jts / nanopolish

Signal-level algorithms for MinION data
MIT License
569 stars 159 forks source link

Making a sequencing summary file from Guppy (need Albacore example) #502

Closed pengelgau closed 6 years ago

pengelgau commented 6 years ago

I'm currently trying to index my reads but I'm finding that it is taking quite a while. (If my estimated time holds true it looks like 7 full days of computer time to complete everything). I think I could speed this up if I converted the sequencing summary file from Guppy into the same format as Albacore. The only problem is that I don't know what the Albacore files looks like, or what format they are in. I've tried looking around but I can't find anything. Below is an example of my Guppy sequencing summary file. It's a txt file with a read id and filename on each line.

76b4dc100267658aa54d86afab31d5da ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_2882_ch_371_strand.fast5 d2b07acc4f7749d54383da68bd0e7a76 ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_2254_ch_253_strand.fast5 2daefd684239f022675c08fe5e272a85 ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_3523_ch_109_strand.fast5 4ed9dddae72e40333d51a0a196c1a05c ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_3515_ch_163_strand.fast5 f3109f57eb28a29e6a736ea3270ade34 ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_3393_ch_502_strand.fast5 431751ffc02c0e0462606ff3fdb1e5ae ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_2191_ch_18_strand.fast5 72e3815d6e493d19e921e536a058af6d ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_3078_ch_506_strand.fast5 baf372d8c39e3cbb79fe8238f0820543 ./13/GXB01136_20180808_FAH87162_GA40000_sequencing_run_A_15756_read_3157_ch_472_strand.fast5

I think I can just use sed to convert it but without an example I can't really try. Any help is greatly appreciated. Thanks for reading.

jts commented 6 years ago

Hi,

What version of guppy are you using? I just looked at a sequencing summary file for one of our recent runs and the format is very similar to albacore's.

Jared

pengelgau commented 6 years ago

Hi Jared,

I used Guppy 1.4.3. I tried using the Guppy one anyways and got that no filename column header error: Could not find filename column in the header of ../../raw_reads/reads/md5.txt I then tried putting in a header with read_id and filename (in that order because of the nature of the file), and I still get the same error. This file seems to be space delimited. Is the albacore one tab delimited?

Phil

jts commented 6 years ago

It looks like you are using this file: ../../raw_reads/reads/albacore_md5.txt as a sequencing summary. I don't think that is the correct file to use. Do you not have files with the name sequencing_summary_nnnn.txt?

pengelgau commented 6 years ago

I can't find a file with that name. I didn't actually perform the sequencing or basecalling myself, they were performed by my school's genomics core. I'll ask them about that file, perhaps they neglected to send that to me. In the meantime, I found that md5.txt file with my raw reads after unzipping, does that not have enough information to reformat into something that nanopolish would prefer?

pengelgau commented 6 years ago

Actually I do have those. I should have looked just a little bit harder before responding... They are in this format: filename read_id run_id channel start_time duration num_events template_start num_events_template template_duration sequence_length_template mean_qscore_template strand_score_template GXB01136_20180817_FAH87054_GA40000_sequencing_run_A_45079_read_6449_ch_327_strand.fast5 9f769fcd-5b23-4479-890c-23b68fbfaa9b f944a0a3b76c9e80f9301ab9f8eb4ed4c31b7971 327 8401.147461 16.5905 13272 8401.285156 13162 16.452999 5858 10.747499 -0.000313 GXB01136_20180817_FAH87054_GA40000_sequencing_run_A_45079_read_4439_ch_200_strand.fast5 7f6e317a-1497-4e37-9019-84d9b012a50a f944a0a3b76c9e80f9301ab9f8eb4ed4c31b7971 200 8412.035156 5.829 4663 8412.21875 4516 5.64525 1821 11.852244 -9.4e-05 I will give these a try and get back to you.

jts commented 6 years ago

Yes, those are the files you need.

pengelgau commented 6 years ago

The files worked just fine. I guess I would suggest that in the help read out for index that you also mention Guppy instead of just Albacore. Nonetheless thanks for the quick help, I greatly appreciate it.

jts commented 6 years ago

Glad to hear it! I'll make a note about mentioning guppy works too.

BCArg commented 5 years ago

I am having the same issue, namely

Could not find filename column in the header of /nexusb/Gridion/20190905MicroRap/Microbio/20190905_1400_GA10000_FAK80986_effcd777/sequencing_summary/GXB01439_20190905_160028_FAK80986_gridion_sequencing_run_Microbio_sequencing_summary.txt

I have sequenced a pool of samples with Gridion (basecaller should be Guppy, don't know exactly which version). I have demultiplexed the samples with qcat and now I want to create the index to link the (demultiplexed) fastq with the fast5 files. After calling the command:

nanopolish index -v -s /nexusb/Gridion/20190905MicroRap/Microbio/20190905_1400_GA10000_FAK80986_effcd777/sequencing_summary/GXB01439_20190905_160028_FAK80986_gridion_sequencing_run_Microbio_sequencing_summary.txt -d /nexusb/Gridion/20190905MicroRap/Microbio/20190905_1400_GA10000_FAK80986_effcd777/fast5_pass/ /nexusb/Gridion/20190905MicroRap/Microbio/20190905_1400_GA10000_FAK80986_effcd777/fastq_pass/demux/BORD1725_barcode03.fastq

I get the error message shown above. In fact there is no filename column in my sequencing_summary file. Below I am displaying its header with one entry:

filename_fastq  filename_fast5  read_id run_id  channel mux     start_time      duration        num_events      passes_filtering        template_start  num_events_template     template_duration       sequence_le
ngth_template   mean_qscore_template    strand_score_template   median_template mad_template    pore_type       experiment_id   sample_id                                                                          
FAK80986_d83ffac69ab548d4fc4f9876b6d2f931ed3827e2_0.fastq       FAK80986_d83ffac69ab548d4fc4f9876b6d2f931ed3827e2_0.fast5       d14889bc-7e81-47ea-8c12-8aa8055fd2f1    d83ffac69ab548d4fc4f9876b6d2f931ed3827e2  4
08      1       8.512250        0.674250        0       TRUE    8.531750        0       0.654750        232     12.107406       0.000000        87.202446       9.349191        not_set 20190905MicroRap        Mic
robio

I noticed that the structure of the file above is somewhat different that that of the sequencing_summary.txt file generated by albacore.

I have installed nanopolish with conda, version 0.11.2 (nanopolish 0.11.2 h705302d_0 bioconda)

Is there any fix for this (other than indexing without the -s option, which appears to be very slow)?

cdlawless commented 4 years ago

Hi @BCArg ,

I ran into the same problem with the output from epi2me. Essentially the file contains the right info, just in a slightly different format.

Here's a quick (and dirty) R script to reformat the summary file: https://www.dropbox.com/s/p9nia675pek1roj/reformat_summary.R?dl=0

Us on command line as: Rscript reformat_summary.R summaryfile reformattedsummaryfile

Best,

Craig