marbl / CHM13

The complete sequence of a human genome
Other
883 stars 96 forks source link

Duplicate fast5 downloads #9

Closed tmassingham-ont closed 4 years ago

tmassingham-ont commented 4 years ago

Hello and many thanks for sharing your data.

I'm currently rebasecalling the data using the latest methods and noticed that many of the fast5 downloads are duplicates of other partitions. Are there reads missing and, if so, is it possible to obtain them please?

I've confirmed the duplication unpacking the files and comparing the reads. Its curious that the duplicate files have a different md5sum to the original; presumably the order in which the reads are packed in the file was not deterministic.

In all, I think there are the following equivalent partitions:

You can approximately confirm the duplication by looking at the file sizes provided by S3

aws s3 --no-sign-request ls --recursive  --summarize s3://nanopore-human-wgs/chm13/nanopore/fast5 | sort -gk 3 | cut -d ' ' -f 3,4 | rev |  uniq -c -f 1 | sort | rev
mattloose commented 4 years ago

One for @aphillippy or @skoren I think.

skoren commented 4 years ago

Looks like you're correct, there are two issues. Partition 98 did not get uploaded correctly so it is missing data. I've replaced partition 98 with the right version.

The rest of the duplicates are redundant. When we packaged the individual partitions into tgz multiple partitions met the same wildcard and thus got packaged into the same files. There shouldn't be any missing fast5 data. I cleaned up the download page to remove the extraneous partitions and renamed the files.

How many total fast5 files do you end up with after extracting all the partitions after fixing 98? There should be about 11m.

tmassingham-ont commented 4 years ago

Thanks. Download in progress, I'll update when I have numbers.

tmassingham-ont commented 4 years ago

Thanks. Download in progress, I'll update when I have numbers.

Looks good to me now, thank you.