marbl / CHM13

The complete sequence of a human genome
Other
882 stars 96 forks source link

Extracting a subset of data from raw nanopore signal data #63

Open hasindu2008 opened 1 year ago

hasindu2008 commented 1 year ago

I was looking for an ONT raw signal dataset at very high coverage (a few 100X) and the nanopore dataset in this repository seems to be ideal. It is just a few genomic regions that I need the raw data for. Is there a way to selectively download a set of read IDs from the raw dataset, without having to download and extract all the terabytes of tar.gz (which I estimate to take weeks-months)?

skoren commented 1 year ago

Unfortunately, we don't have the data organized by chromosome so your only option would be to download and extract the full set. If you have IDs of the reads you're interested in and post them here, I can try to look up which partitions they are in and you can download just those.

hasindu2008 commented 1 year ago

As the reads seemed to be distributed all throughout the partitions (and I would have to iteratively try different subsets), I ended up downloading the whole thing and after like 2 weeks it has fully downloaded! Now extracting all and hopefully, the file system can handle a large number of files. Let you know how it goes. This is an exciting dataset.

gringer commented 1 year ago

It'd be really useful to have fast5 files sorted by chromosome/position. That'd be a lot of effort to set up, though.

hasindu2008 commented 1 year ago

@gringer When it is in FAST5 - yes every manipulation task is hard. I have successfully converted all the partitions into BLOW5 recently and now any type of sorting is now a few bash commands. I would be able to provide such sorting if you are interested.

@skoren Do you have the total number of reads in the dataset? After conversion to BLOW5, the total size was reduced to 3.4TB, which was originally 5.2TB in compressed FAST5 tar.gz archives. This is to double-check if all the reads are present in the converted version.

Marynotmartha commented 1 year ago

Ask me for my raw DNA.

On Thu, Aug 25, 2022 at 12:13 AM Hasindu Gamaarachchi < @.***> wrote:

@gringer https://github.com/gringer When it is in FAST5 - yes every manipulation task is hard. I have successfully converted all the partitions into BLOW5 recently and now any type of sorting is now a few bash commands. I would be able to provide such sorting if you are interested.

@skoren https://github.com/skoren Do you have the total number of reads in the dataset? After conversion to BLOW5, the total size was reduced to 3.4TB, which was originally 5.2TB in compressed FAST5 tar.gz archives. This is to double-check if all the reads are present in the converted version.

— Reply to this email directly, view it on GitHub https://github.com/marbl/CHM13/issues/63#issuecomment-1226752175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUGHGEJYSF6GPNV6A4LTUSLV23XGNANCNFSM536VQQFA . You are receiving this because you are subscribed to this thread.Message ID: @.***>