jts / nanopolish

Signal-level algorithms for MinION data
MIT License
567 stars 159 forks source link

Squiggle Read assertion with albacore 1.0.3 #137

Closed zephyris closed 7 years ago

zephyris commented 7 years ago

Nanopolish variants throws an assert when using locally called 2D data using albacore 1.0.3: nanopolish: src/nanopolish_squiggle_read.cpp:283: void SquiggleRead::_load_R9(uint32_t, const string&, const std::vector&, const std::vector&, uint32_t): Assertion `fields.size() == 4' failed. Is this a known issue with locally called files? Or have I missed a required step (e.g. somehow adding events) for using locally called files?

jts commented 7 years ago

This is probably an issue with locally called files. Could you email me a single fast5 file from your run?

damientully commented 7 years ago

Hi Jared,

I am using nanopolish v 0.6.1 and am getting the same error with locally basecalled reads using albacore v 1.1.0. Do you have a solution to this? Do you want some fast5 files?

Thanks, Damien

jts commented 7 years ago

@damientully can you try the brand-new change that @mateidavid just committed, which allows you to extract basecalls for a specific version of albacore?

On Thu, May 11, 2017 at 3:29 PM, damientully notifications@github.com wrote:

Hi Jared,

I am using nanopolish v 0.6.1 and am getting the same error with locally basecalled reads using albacore v 1.1.0. Do you have a solution to this? Do you want some fast5 files?

Thanks, Damien

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jts/nanopolish/issues/137#issuecomment-300892943, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXxn05pgtd6NQ4leL0PH5LpUAaLxk1sks5r42F8gaJpZM4NV-ZD .

jts commented 7 years ago

@zephyris when you say locally basecalled do you mean MinKNOW or albacore?

jts commented 7 years ago

I've just pushed a fix to github, can you try it?

zephyris commented 7 years ago

Brilliant, I'll pull it and try it. Here are some reads if it's still of use: https://www.dropbox.com/s/dpnpgal2lynr67w/albacore1.0.3.zip?dl=0 One folder is reads originally called via metrichor recalled with albacore. The other is reads only called by albacore.

damientully commented 7 years ago

@jts Seems to be running now with no errors. Thanks Jared

jts commented 7 years ago

Great, thanks. Note that at the end of the run nanopolish will now report how many reads were unusable due to parsing errors. You might want to keep an eye on this as you may lose some coverage. This should be fixed in an upcoming albacore release.

zephyris commented 7 years ago

Looks good, thank you very much.

zephyris commented 7 years ago

Doing some more checks it looks like nanopolish extract doesn't see 2D calls from albacore 1.0.3 and just extracts the template.

jts commented 7 years ago

I'll look into this (I think this is the same issue as discussed here https://github.com/jts/nanopolish/issues/96)

jts commented 7 years ago

Could you send me an example read @zephyris?

zephyris commented 7 years ago

Here's some reads: https://www.dropbox.com/s/dpnpgal2lynr67w/albacore1.0.3.zip?dl=0 All three in the albacore1.0.3 folder show this precise issue. It seems likely the same issue as #96

For reference, here's a comparison of sequence extraction with nanopolish and poretools: ` $ nanopolish extract gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand.fast5 | grep '>'

519dffd8-8dcb-4f82-a31a-f4d02a0b70e8_Basecall_1D_template:1D_001:template gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand.fast5 [extract] found 1 files, extracted 1 reads $ poretools fasta gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand.fast5 | grep '>' 519dffd8-8dcb-4f82-a31a-f4d02a0b70e8_Basecall_2D gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand.fast5 channel_2_519dffd8-8dcb-4f82-a31a-f4d02a0b70e8_complement gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand.fast5 channel_2_519dffd8-8dcb-4f82-a31a-f4d02a0b70e8_template gull27_20170322_FNFAF14058_MN18485_sequencing_run_20170322_SmOxP9_15800Bright_81808_ch2_read6204_strand.fast5 `

jts commented 7 years ago

Thanks, I will look tomorrow.

jts commented 7 years ago

I think I fixed this in 4d012ca, could you try it?

zephyris commented 7 years ago

Works perfectly on a handful of test reads, I'll run a full polish using it now and see how it goes. Thanks!

zephyris commented 7 years ago

The sequence extraction works fine, but now I'm getting the squiggle read errors again: 'nanopolish: src/nanopolish_squiggle_read.cpp:293: void SquiggleRead::_load_R9(uint32_t, const string&, const std::vector&, const std::vector&, uint32_t): Assertion `fields.size() == 4' failed.'

johnomics commented 7 years ago

Nanopolish is rejecting roughly 60% of a read set called with Albacore 1.1.0, based on the new warnings, eg: warning: nanopolish could not parse 192 reads out of 289

I assume this is the same problem as above and will require an Albacore fix, but please could you document what the problem is? Why does it only affect a subset of reads? Is there anything that can be done to work around it?

I'm using a subset of high quality reads (Q>=16, read length>10kb), so could add in other reads to make up for the loss - but how much coverage is enough? At what point does accuracy plateau?

I ran nanopolish extract with -t template -b albacore:1.1.0.

jts commented 7 years ago

For some fast5s the k-mers annotated in the event table do not match the k-mers in the basecalled read. Nanopolish relies on this information to figure out which event(s) correspond to each read k-mer. In some cases it is possible to skip the erroneous k-mers to recover the read but in the worst case the entire read is skipped.

Unfortunately this can't be fixed in Albacore so I'm going to have to come up with a better way to mitigate this issue as losing 60% of the reads is clearly unacceptable. I'll try to work on this soon but it may take a little while. In the meantime if you have excess coverage I suggest trying to get 50-100X after the rejection of reads.

johnomics commented 7 years ago

Thank you very much - OK, I'll retry with all the reads.

jts commented 7 years ago

@johnomics do you think I could get a tar.gz of those 289 reads? It would make a good test case

zephyris commented 7 years ago

Just to say; still getting the squiggle assert with 2D Albacore 1.0.3 calls.

Looks like the issue is only with 2d called reads. Extracting just the template and complement calls gives just the ~50% failure rate described above.

jts commented 7 years ago

Ok, thanks Richard. I'm working on the template/complement issue first as its more pressing - I'll look at 2D afterwards.

zephyris commented 7 years ago

Completely understand the priorities, I just confirmed the exact issue and wanted to let you know.

johnomics commented 7 years ago

@jts I just emailed you about a test set of reads - please let me know if you don't receive it. Sorry it took a while to get to this.

jts commented 7 years ago

Hi all,

I've implemented a new parsing method (with help from ONT) that should workaround this issue in most cases. It isn't very well tested yet so I've put it to a new branch: https://github.com/jts/nanopolish/tree/albacore_move_workaround. Give it a try and let me know how it looks!

Jared

alelim-bio commented 7 years ago

Hello Jared,

We have received the same error as indicated in this thread using locally basecalled 2D reads from albacore v. 1.1.0 . I would also like to mention that we are using barcoded reads from the LSK-208 native barcoding kit. We have attempted to implement your albacore work around as well as nanopolish version 0.61. However, we are still unsuccessful in getting pass this error. Would it help if we sent some .fast5 files?

Kind Regards,

Alex

jts commented 7 years ago

Yes, could you please send me some .fast5s?

alelim-bio commented 7 years ago

Hello Jared,

Attached to the link are some of are .fast5 files. Each of these are from their respective barcode. Thank you for your help. https://drive.google.com/drive/folders/0Bx0yLMWhTP5VQkRScWt0d05scUk

Kind Regards,

Alex

jts commented 7 years ago

Thanks Alex could you upload the reference genome to that google drive too?

alelim-bio commented 7 years ago

Hello Jared,

I have uploaded the reference assemblies.

Best,

Alex

jts commented 7 years ago

Thanks Alex. I've been able to debug the issue and it appears that albacore encodes the 2D Alignment table slightly differently than metrichor did. As 2D data is being replaced by 1D^2 I'm reluctant to modify nanopolish to handle both encodings - I suggest you input your reads into nanopolish as 1D sequences using (nanopolish extract --type template) for now.

zephyris commented 7 years ago

Thanks for this update, the branch works well for me too. It's a shame not to be able to use the 2D data, but the 1D polish certainly works OK.

zephyris commented 7 years ago

I've re-run my complete polish, however polishing gives not change to the sequence; no variants are called on any contig window... The stdout log lists just the "#CHROM\tPOS\t"... header. The [post-run summary] indicates the reads could not be calibrated. Is this a linked issue, or something else?

jts commented 7 years ago

You should only get a few calibration errors, certainly not all reads should fail calibration. Is this the same dataset that you posted on dropbox above?

zephyris commented 7 years ago

Yup; I get the same issue for the dataset of reads called with metrichor then recalled with albacore and the reads only called by albacore.

jts commented 7 years ago

Richard, could you try basecalling with very recently released albacore 1.2.1? It fixes some nanopolish compatibility issues and I hope it will take care of your problem. Note that you'll have to instruct nanopolish extract to explicitly look for the 1.2.1 calls (-b albacore:1.2.1).

zephyris commented 7 years ago

No problem. I'm running the recall now. Will obviously take a bit of time but I'll report back ASAP. Thanks again for looking into this!

alelim-bio commented 7 years ago

Hello Jared,

After following your instructions the extraction did complete. However, after complete polishing we also ended up with the same problem that zephryis ran into. All of our reads after a complete polish could not be calibrated. Should we also look into re-basecalling with the new albacore version 1.2.1?

Kind Regards,

Alex

jts commented 7 years ago

I suspect it is the same issue. You may wish to hear from @zephyris about whether the problem was fixed before re-basecalling though.

Jared

zephyris commented 7 years ago

It looks like the recalling with albacore 1.2.1 hasn't helped. I've only tried it on a subset of reads which have been called by metrichor, then recalled by albacore 1.2.1 though. I'll keep you posted about reads which have not previously been called.

jts commented 7 years ago

Ah, I'm sorry to hear that. Please send reads when you can and I'll take a closer look (I'll be away until June 21 though so may not be very quick to respond)

zephyris commented 7 years ago

https://www.dropbox.com/s/f9x1joth232nssj/albacore1.2.1.zip?dl=0 Here you go. It's the same three metrichor then albacore recalled reads I posted before, recalled on windows with albacore 1.2.1. Have a good trip!

jts commented 7 years ago

@zephyris I have fixed this issue in the branch recalibration-fix. I need to test it on other datasets before merging it into master but it should work for your dataset.

kirstyn commented 7 years ago

@jts I have been having a similar problem with 1D reads basecalled on albacore (tried versions 1.1.0 and 1.2.1). I've also tried the recalibration-fix, which improves things but I am having issues getting nanopolish variants to process the entire genome window. If I ask for -w to cover my full genomw length (~12kb) it just can't manage, but asking for 1kb at a time works.

jts commented 7 years ago

Hi @kirstyn, what do you mean by it can't manage the full genome length? Does it run out of memory?

kirstyn commented 7 years ago

@jts I don't know, the process just seems to freeze and then terminate

jts commented 7 years ago

It probably ran out of memory and is an unrelated issue. If the genome is only 12kb you may have excessively deep coverage. Maybe try downsampling to 100-200X?

kirstyn commented 7 years ago

Ok, it is downsampled already... but yes still must be a memory issue, I'll work on it! My main issue was the read calibration, which seems to be ok with the recalibration-fix so thanks for that!

zephyris commented 7 years ago

@jts The recalibration-fix branch worked perfectly, ran and polished with no issues. Thank you!