jts / nanopolish

Signal-level algorithms for MinION data
MIT License
568 stars 159 forks source link

Error when trying to variant call using an R10 model #1059

Closed hasindu2008 closed 1 year ago

hasindu2008 commented 1 year ago

Hi Jared,

I have trained a basic k-mer model for R10.4.1 using your R10 branch in nanopolish, using the recent R10.4.1 base models provided by ONT. I am trying to evaluate some variant calls using that, but I am getting the following error

nanopolish variants -r reads.fastq -b reads.bam -g ~/scratch/hg38noAlt/hg38noAlt.fa -p2 --models-fofn=a.txt
Error: unknown model:

The contents of a.txt looks like:

../r10-models/r10.4.1_400bps.nucleotide.9mer.model

The head of the ../r10-models/r10.4.1_400bps.nucleotide.9mer.model is like:

#model_name     r9.4_450bps.nucleotide.6mer.template.model
#kit    r9.4_450bps
#k      9
#strand template
#alphabet       nucleotide
AAAAAAAAA       54.20490641     3.0     1.0     1.0
AAAAAAAAC       58.59079727     3.0     1.0     1.0
AAAAAAAAG       55.952068       3.0     1.0     1.0
AAAAAAAAT       58.43352009     3.0     1.0     1.0
AAAAAAACA       63.65995928     3.0     1.0     1.0

I initially had the following but changed the name to R9.4 to see if the name was the issue.

#model_name     r9.10.1_400bps.nucleotide.9mer.template.model
#kit    r10.4.1_40bps

Is there a way to test a new R10 9-mer model in nanopolish?

I manage to do an event alignment using f5c eventalign using this model and most reads successfully aligned. As I am not sure of the best way to check the accuracy of signal alignments, I thought perhaps using the nanopolish variant caller could be a solution. Your thoughts are welcome.

jts commented 1 year ago

It would appear that nanopolish failed to parse the header for some reason. What you pasted looks fine, could you provide the model to me so I can look into this more closely?

hasindu2008 commented 1 year ago

Hi Jared, Thanks for the quick response. The file is available here https://raw.githubusercontent.com/hasindu2008/f5c/r10/test/r10-models/r10.4.1_400bps.nucleotide.9mer.model

hasindu2008 commented 1 year ago

Also, when training for a large human dataset using the r10 branch, the nanopolish train seems to crash after some time:

/usr/bin/time -v nanopolish  train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam  -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../pcr/
[train] initialized r10_450bps for alphabet nucleotide for 9-mers
[train] round 0
nanopolish: src/nanopolish_train.cpp:515: void add_aligned_events_for_read(const ReadDB&, const faidx_t*, std::vector<StateSummary>&, const string&, const string&, size_t, std::unordered_map<unsigned int, int>&, const std::vector<BedMethylRecord>&, const bam_hdr_t*, const bam1_t*, size_t, int, int): Assertion `training_kit == sr.get_model_kit_name(strand_idx)' failed.

Any idea how this can be fixed? Now I am trying to extract chromosome 1 and try running only on it.

hasindu2008 commented 1 year ago

The, training for chr1 did not crash, but unfortunately none of the k-mers have been trained.

[train] initialized r10_450bps for alphabet nucleotide for 9-mers
[train] round 0
[bam process] processed 566251 bam records in 590.00s (960 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s
[train] round 1
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 1.00s
[train] round 2
[bam process] processed 566251 bam records in 587.00s (965 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 1.00s
[train] round 3
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s
[train] round 4
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s

Could be some parameters? Incase you want to reproduce, all the necessary files are here as a tarball. To reproduce:

tar xf nanopolish-train.tar 
cd nanopolish-train/PGXX22563_pcr/
../slow5-nanopolish-train/nanopolish  train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam  -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../pcr/

This uses the SLOW5 integrated nanopolish r10 branch so we can do the training fast when signal files are residing on NFS and @hiruna72 has separately sent a PR. Note: The training worked for R9.4.1 data when existing nanopolish R9.4 were used as base models (though the accuracy got worse because the existing pore-model for R9.4 was anyway good)

jts commented 1 year ago

So, going back to the original issue when I run with your model I get this error:

Error, unrecognized model kit r10.4.1_400bps

This is because the metadata in nanopolish hasn't been set up for R10.4.1:

https://github.com/jts/nanopolish/blob/r10/src/pore_model/nanopolish_model_names.cpp#L17

If I change the model kit to be r10_450bps the model loads correctly but all the reads are discarded, likely because the signal calibration fails. This is probably why no training happens on chr1. I will try to look into why calibration fails.

hasindu2008 commented 1 year ago

Hi Jared,

After renaming as you said, the pore model seems to load. I added some debug prints and the alignment step seem to fail.

However, when I run f5c event align with the same model, most reads align and pass calibration (see attached if they look reasonable). Is the adaptive banded event alignment algorithm in the r10 branch different to the one that used to be in Nanopolish main branch (that I used 3 years ago when writing f5c)? I briefly looked at the r10 code and they look considerably different.

summary.tsv.gz events.tsv.gz

jts commented 1 year ago

The r10 branch training method re-uses the adaptive banded alignment code, which I made more generic to support this use. I thought it was functionally equivalent but I will look into it.

jts commented 1 year ago

I've tried your f5c r10 branch on some local R10.4.1 data I generated but no reads aligned:

[meth_main] total entries: 4238, qc fail: 0, could not calibrate: 0, no alignment: 4238, bad fast5: 0

Could you provide me with the small set of reads that you used to generate eventalign.tsv and summary.tsv?

hasindu2008 commented 1 year ago

Hi Jared,

You can download a little dataset from here. Commandline should like like:

./f5c eventalign -b test/hg2_lsk114_reads_1000/PGXX22394_reads_1000_6.4.2_sup.bam -r test/hg2_lsk114_reads_1000/PGXX22394_reads_1000_6.4.2_sup.fastq -g hg38noAlt.fa --slow5 test/hg2_lsk114_reads_1000/PGXX22394_reads_1000.blow5 --kmer-model test/r10-models/r10.4.1_400bps.nucleotide.9mer.model

The output is like:

[meth_main] total entries: 1076, qc fail: 2, could not calibrate: 0, no alignment: 156, bad fast5: 0
hasindu2008 commented 1 year ago

@jts Could you also provide me with a little portion of your dataset? Would like to investigate what the difference is?

jts commented 1 year ago

Ah, f5c does work for me - I was providing the model incorrectly before.

jts commented 1 year ago

I needed to make two changes in https://github.com/jts/nanopolish/commit/5b6ac67ba94496164d2514b9a89fe4db4fc307de to make the reads load:

  1. the R10 pore detection was failing due to the new flowcell type
  2. the event detection parameters that I used for R10.0 do not seem to work for R10.4.1, so I have gone back to the r9 ed parameters for now (which I assume f5c is using)
jts commented 1 year ago

Note that the r10 branch is now far out of date with master. Rebasing it seems messy so I will likely re-implement R10.4.1 support from scratch

hasindu2008 commented 1 year ago

Cool. Now it seems to actually do training some k-mers when I tested with chr1. Now let me run with the whole genome. For a PCR'ed dataset, does the following command look right to you? nanopolish train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../kmers/

hasindu2008 commented 1 year ago

Hi @jts

After 5 training rounds, roc plots for snps look like this.

image

Do you think the accuracy has saturated and would it be beneficial to go for more rounds? Or any idea how we could improve the accuracy further - for instance parameters and thresholds?