marbl / CHM13

The complete sequence of a human genome
Other
882 stars 96 forks source link

guppy v6 #64

Closed aafshinfard closed 3 weeks ago

aafshinfard commented 1 year ago

Just wanted to ask if there are any plans on releasing a guppy >= v6 base calling of the reads? Thanks.

skoren commented 1 year ago

No immediate plans since we're not actively working on CHM13 and we've not found much benefit going to guppy 6+ with our hybrid assembly method.

aafshinfard commented 1 year ago

Thanks for the response @skoren

hasindu2008 commented 1 year ago

Given that I recently downloaded the whole raw signal dataset, I am planning to do a Guppy 6 rebasecall. If it succeeds (and not sure how much time it will take) and if your AWS storage can host more data @skoren , I can share it to be shared.

aafshinfard commented 1 year ago

@hasindu2008 That would be awesome!

hasindu2008 commented 1 year ago

@aafshinfard I have recently converted all the raw data to bloe5 format and have basecalled using Guppy 6.1.3 hac model. Given the large size of the files, I am not sure how I could share, Any suggestions?

aafshinfard commented 1 year ago

@hasindu2008 Nice to hear you did it. How large are the files?

aafshinfard commented 1 year ago

@hasindu2008 Would be nice if the T2T team can host this (@skoren), but another option would be Zenodo. I heard they support up to 50GB and even more in special cases... https://www.youtube.com/watch?v=S1qK_TA52e4&t=251s

arangrhie commented 1 year ago

@aafshinfard how big is the total file size?

aafshinfard commented 1 year ago

@arangrhie, I opened the issue and @hasindu2008 kindly did the job; waiting for them to respond about the size of the dataset.

hasindu2008 commented 1 year ago

@arangrhie @aafshinfard

The basecalled fastq files gzipped are relatively small and I think can be easily hosted. 288G hg2_merged_pass.fastq.gz 39G hg2_merged_fail.fastq.gz

The raw signal data converted to BLOW5 are 3.4 TB. I had to convert that 5TB+ FAST5 compressed tarballs to BLOW5; otherwise, base-calling using FAST5 would have taken a few weeks. It would be useful for the future if those BLOW5 can be hosted to allow direct base-calling from S3 storage mounted locally, as well as partial download of certain genomic regions when necessary (see #63). Compressed tarballs of FAST5 for this kind of large dataset is not easily accissible and diminishes the value of a useful dataset like this in my opinion.

hasindu2008 commented 1 year ago

@aafshinfard You may download the merged Guppy 6 basecalls for the whole dataset here:

https://slow5test.s3.amazonaws.com/tmp/chm13_merged_pass.fastq.gz https://slow5test.s3.amazonaws.com/tmp/chm13_merged_fail.fastq.gz

Note that this is not a free S3 storage like the one used for hosting CHM13, so I will be grateful if you can let me know after you download it so that I can delete it then. Otherwise, AWS keeps on charging.

@skoren CHM13 maintainers feel free to copy this file into their free S3 storage if you think it will be useful to anyone in future.

Software and versions used for the basecalling are explained below: Nanopore raw signal data were downloaded, extracted and then converted to BLOW5 format using slow5tools. Then, they were basecalled using buttery-eel under Guppy 6.3.7 high accuracy mode. Qscore 7 was used for pass and fail cut-off.

Base-calling commands:

#basecall gridION data

buttery-eel  -i  min_grid.blow5  --guppy_bin /install/ont-guppy-6.3.7/bin/  --config dna_r9.4.1_450bps_hac.cfg -x cuda:all -q 7 -o reads_min_grid.fastq --port 5555  --use_tcp

#basecall promethION data
buttery-eel  -i  prom.blow5  --guppy_bin /install/ont-guppy-6.3.7/bin/  --config dna_r9.4.1_450bps_hac_prom.cfg -x cuda:all -q 7 -o reads_prom.fastq --port 5556  --use_tcp
aafshinfard commented 1 year ago

@hasindu2008 Awesome, thank you so much!

aafshinfard commented 1 year ago

@hasindu2008 Just started downloading; should be done tonight. Will confirm after it has finished. Thanks again.

aafshinfard commented 1 year ago

@hasindu2008 Just confirming that my download was completed. Thank you so much for your help.

hasindu2008 commented 1 year ago

@aafshinfard No problem, glad to help. If this becomes useful in your work please consider citing BLOW5 which allowed us to do this basecalling with very little budget, which otherwise would require to spend a fortune.

aafshinfard commented 1 year ago

Sure thing, thank you @hasindu2008

skoren commented 3 weeks ago

Thanks for contributing these, sorry this dropped of my radar. I put a link to the NCBI hosted files for both now.