Closed hasindu2008 closed 1 year ago
It would appear that nanopolish failed to parse the header for some reason. What you pasted looks fine, could you provide the model to me so I can look into this more closely?
Hi Jared, Thanks for the quick response. The file is available here https://raw.githubusercontent.com/hasindu2008/f5c/r10/test/r10-models/r10.4.1_400bps.nucleotide.9mer.model
Also, when training for a large human dataset using the r10 branch, the nanopolish train seems to crash after some time:
/usr/bin/time -v nanopolish train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../pcr/
[train] initialized r10_450bps for alphabet nucleotide for 9-mers
[train] round 0
nanopolish: src/nanopolish_train.cpp:515: void add_aligned_events_for_read(const ReadDB&, const faidx_t*, std::vector<StateSummary>&, const string&, const string&, size_t, std::unordered_map<unsigned int, int>&, const std::vector<BedMethylRecord>&, const bam_hdr_t*, const bam1_t*, size_t, int, int): Assertion `training_kit == sr.get_model_kit_name(strand_idx)' failed.
Any idea how this can be fixed? Now I am trying to extract chromosome 1 and try running only on it.
The, training for chr1 did not crash, but unfortunately none of the k-mers have been trained.
[train] initialized r10_450bps for alphabet nucleotide for 9-mers
[train] round 0
[bam process] processed 566251 bam records in 590.00s (960 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s
[train] round 1
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 1.00s
[train] round 2
[bam process] processed 566251 bam records in 587.00s (965 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 1.00s
[train] round 3
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s
[train] round 4
[bam process] processed 566251 bam records in 589.00s (961 records/s). Latest: chr1:248882296
[train] trained 0/262144 k-mers in 0.00s
Could be some parameters? Incase you want to reproduce, all the necessary files are here as a tarball. To reproduce:
tar xf nanopolish-train.tar
cd nanopolish-train/PGXX22563_pcr/
../slow5-nanopolish-train/nanopolish train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../pcr/
This uses the SLOW5 integrated nanopolish r10 branch so we can do the training fast when signal files are residing on NFS and @hiruna72 has separately sent a PR. Note: The training worked for R9.4.1 data when existing nanopolish R9.4 were used as base models (though the accuracy got worse because the existing pore-model for R9.4 was anyway good)
So, going back to the original issue when I run with your model I get this error:
Error, unrecognized model kit r10.4.1_400bps
This is because the metadata in nanopolish hasn't been set up for R10.4.1:
https://github.com/jts/nanopolish/blob/r10/src/pore_model/nanopolish_model_names.cpp#L17
If I change the model kit to be r10_450bps
the model loads correctly but all the reads are discarded, likely because the signal calibration fails. This is probably why no training happens on chr1. I will try to look into why calibration fails.
Hi Jared,
After renaming as you said, the pore model seems to load. I added some debug prints and the alignment step seem to fail.
However, when I run f5c event align with the same model, most reads align and pass calibration (see attached if they look reasonable). Is the adaptive banded event alignment algorithm in the r10 branch different to the one that used to be in Nanopolish main branch (that I used 3 years ago when writing f5c)? I briefly looked at the r10 code and they look considerably different.
The r10 branch training method re-uses the adaptive banded alignment code, which I made more generic to support this use. I thought it was functionally equivalent but I will look into it.
I've tried your f5c r10 branch on some local R10.4.1 data I generated but no reads aligned:
[meth_main] total entries: 4238, qc fail: 0, could not calibrate: 0, no alignment: 4238, bad fast5: 0
Could you provide me with the small set of reads that you used to generate eventalign.tsv and summary.tsv?
Hi Jared,
You can download a little dataset from here. Commandline should like like:
./f5c eventalign -b test/hg2_lsk114_reads_1000/PGXX22394_reads_1000_6.4.2_sup.bam -r test/hg2_lsk114_reads_1000/PGXX22394_reads_1000_6.4.2_sup.fastq -g hg38noAlt.fa --slow5 test/hg2_lsk114_reads_1000/PGXX22394_reads_1000.blow5 --kmer-model test/r10-models/r10.4.1_400bps.nucleotide.9mer.model
The output is like:
[meth_main] total entries: 1076, qc fail: 2, could not calibrate: 0, no alignment: 156, bad fast5: 0
@jts Could you also provide me with a little portion of your dataset? Would like to investigate what the difference is?
Ah, f5c does work for me - I was providing the model incorrectly before.
I needed to make two changes in https://github.com/jts/nanopolish/commit/5b6ac67ba94496164d2514b9a89fe4db4fc307de to make the reads load:
Note that the r10 branch is now far out of date with master. Rebasing it seems messy so I will likely re-implement R10.4.1 support from scratch
Cool. Now it seems to actually do training some k-mers when I tested with chr1. Now let me run with the whole genome. For a PCR'ed dataset, does the following command look right to you?
nanopolish train -r PGXX22563.pass.fastq -g hg38noAlt.fa -b PGXX22563_pass.bam -t 64 --train-kmers=all --input-model-filename ../r10-models/r10.4.1_400bps.nucleotide.9mer.model -d ../kmers/
Hi @jts
After 5 training rounds, roc plots for snps look like this.
Do you think the accuracy has saturated and would it be beneficial to go for more rounds? Or any idea how we could improve the accuracy further - for instance parameters and thresholds?
Hi Jared,
I have trained a basic k-mer model for R10.4.1 using your R10 branch in nanopolish, using the recent R10.4.1 base models provided by ONT. I am trying to evaluate some variant calls using that, but I am getting the following error
The contents of a.txt looks like:
The head of the ../r10-models/r10.4.1_400bps.nucleotide.9mer.model is like:
I initially had the following but changed the name to R9.4 to see if the name was the issue.
Is there a way to test a new R10 9-mer model in nanopolish?
I manage to do an event alignment using f5c eventalign using this model and most reads successfully aligned. As I am not sure of the best way to check the accuracy of signal alignments, I thought perhaps using the nanopolish variant caller could be a solution. Your thoughts are welcome.