pod5 don't seem to save much space on short reads

billytcl commented 1 year ago

I have some fast5s of short reads (~170bp) that total 1.7T:

billylau@suzuki:/mnt/ix1/Seq_Runs/20230721_PRM_1377/Seq_Output$ find . -iname '*.fast5' -print0 | du -ch --files0-from=- | tail
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_61.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_35.fast5
19M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_14.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_86.fast5
19M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_2.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_18.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_77.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_56.fast5
20M     ./20230721_1655_3E_PAQ68008_9290fd40/fast5_pass/barcode11/PAQ68008_pass_barcode11_9290fd40_633ff27b_39.fast5
1.7T    total

When I convert it to a pod5 file, it doesn't seem to change much in the file size, although I thought that was one of the main advantages compared to fast5:

billylau@suzuki:/mnt/ix1/Seq_Runs/20230721_PRM_1377/pod5$ ll -h
total 1.5T
drwxrwxr-x 2 billylau jiseqruns    4 Aug 18 22:01 ./
drwxrwxr-x 7 billylau jiseqruns   13 Aug 25 14:50 ../
-rw------- 1 billylau jiseqruns 6.4M Aug 18 22:01 nohup.out
-rw-rw-r-- 1 billylau jiseqruns 1.5T Aug 18 22:01 output.pod5

Am I doing anything wrong here? The faster performance is nice but having a large file like that is a little scary especially if it gets corrupted.

sklages commented 1 year ago

We usually write pod5 in minknow. Currently still in 4K batches (which is the default), just like fast5.

We always convert the raw data files (either fast5/pod5) into a single large pod5 file for immediate basecalling. This process is quite fast.

After basecalling this single pod5 is removed. We always keep the original files (written by minknow) for long-term storage.

AFAICS the single POD5 is just the size of the run data. At least for the data I have worked with here so far.

billytcl commented 1 year ago

Is there a big performance difference with single pod5 vs a bunch of individual pods? I may just do the one-to-one conversion instead.

On Wed, Aug 30, 2023 at 11:24 AM sklages @.***> wrote:

We usually write pod5 in minknow. Currently still in 4K batches (which is the default), just like fast5.

We always convert the raw data files (either fast5/pod5) into a single large pod5 file for immediate basecalling. This process is quite fast.

After basecalling this single pod5 is removed. We always keep the original files (written by minknow) for long-term storage.

AFAICS the single POD5 is just the size of the run data. At least for the data I have worked with here so far.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pod5-file-format/issues/66#issuecomment-1699645046, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYT4X6T6RKUUBNNEOMLDXX6ANHANCNFSM6AAAAAA4E4LIWU . You are receiving this because you authored the thread.Message ID: @.***>

sklages commented 1 year ago

I haven't benchmarked, generally I'd say reading from one big file is more efficient than reading a few hundred or even thousand files. You may run into I/O issues with many small files so that your GPU may be loaded "suboptimal", because it does not get data fast enough.

I have different sources of sequencing data, so converting/merging - whatever I get - into a single pod5 file, makes dorado always start with the same input (type). Converting/merging is quite fast, even for large datasets. So if basecalling takes 30h instead of 29h .. well, I don't care ;-)

Is there any special reason why you want to convert one-by-one?

billytcl commented 1 year ago

I’m just scared of a single 2TB file of being corrupted, and we are thinking of ways to archive a bunch of our old runs that were pre-pod5! With the amount of reads it actually takes 8-17h to convert the fast5s to a single pod.

On Wed, Aug 30, 2023 at 12:40 PM sklages @.***> wrote:

I haven't benchmarked, generally I'd say reading from one big file is more efficient than reading a few hundred or even thousand files. You may run into I/O issues with many small files so that your GPU may be loaded "suboptimal", because it does not get data fast enough.

I have different sources of sequencing data, so converting/merging - whatever I get - into a single pod5 file, makes dorado always start with the same input (type). Converting/merging is quite fast, even for large datasets. So if basecalling takes 30h instead of 29h .. well, I don't care ;-)

Is there any special reason why you want to convert one-by-one?

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pod5-file-format/issues/66#issuecomment-1699733800, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYT2FQW765ARNCNNOR3LXX6JLBANCNFSM6AAAAAA4E4LIWU . You are receiving this because you authored the thread.Message ID: @.***>

sklages commented 1 year ago

I just converted a small fast5 dataset (from 2020), ~1000 files, 650GB in 40min. Data was read from a NFS mount .. may be even faster from local storage.

We are also thinking about archiving old run data .. we are probably going for per-flowcell-pod5 files

billytcl commented 1 year ago

There could possibly be a lot of overhead from our dataset, considering they are short ~160bp reads! We are also converting from an NFS mount.

On Wed, Aug 30, 2023 at 1:24 PM sklages @.***> wrote:

I just converted a small fast5 dataset (from 2020), ~1000 files, 650GB in 40min. Data was read from a NFS mount .. may be even faster from local storage.

We are also thinking about archiving old run data .. we are probably going for per-flowcell-pod5 files

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/pod5-file-format/issues/66#issuecomment-1699788935, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPHYT2CCIMEC6QTWYMGATDXX6ORDANCNFSM6AAAAAA4E4LIWU . You are receiving this because you authored the thread.Message ID: @.***>

0x55555555 commented 1 year ago

Hi @billytcl,

I don't believe youre doing anything wrong - it looks like the dataset reduces in size by about 0.88... I agree I have seen more significant reductions in the past. It will depend on the original compression of the fast5 dataset, and the length of the reads.

I’m just scared of a single 2TB file of being corrupted.

I think this is fair, this is a significant amount of data. And converting this amount from fast5 could take some time. I'm not sure I can recommend whether one massive file is better than several smaller ones for archiving, it likely depends on your storage system and backup processes.

If you are open to sharing some of the source/dest data I can investigate why more space wasn't saved on the conversion? It might also be worth doing some estimations on the number of bytes used per read for the whole dataset - does that number seem reasonable? It might hint that either the fast5 dataset is smaller than expected, or the pod5 is larger.

Thanks,

George

nanoporetech / pod5-file-format

pod5 don't seem to save much space on short reads #66