Modifications in DNA samples

Stautis commented 3 years ago

Hi,

Let me preface the question by saying this isn't so much a feature request as it is a query with regards to existing feature capability. I am aware that Nanocompore was designed for the purpose of running a pairwise comparison analysis on RNA samples, but I have been experimenting with trying to quantify and visualize differences between a modified and an unmodified DNA sample. Thus far, I have been getting errors with regards to kmer correspondence (between .tsv input and reference), but I'm wondering if there is any particular reason as to why this sort of pairwise signal comparison shouldn't work for DNA data as well?

Thanks, Thomas

tleonardi commented 3 years ago

Hi @Stautis, this is something certainly feasible although we haven't tested it. In principle, the workflow on DNA should work without much change to nanocompore. If you could post the error that you get I'd be happy to take a look.

Stautis commented 3 years ago

Hey @tleonardi, Thanks for such a swift reply! So the error that I'm seeing is:

nanocompore.common.NanocomporeError: The result database is empty

due to:

Reading eventalign index files References found in index: 1 Filtering out references with low coverage References remaining after reference coverage filtering: 0

I made the two .tsv files I'm comparing from alignments of identical 3' segments of two different constructs. I double checked that the contents of the .tsv files are identical to one another and that they correspond to the reference I'm providing when running SampComp (which is the same I'm using to produce the alignments)

I've attached a one-read sample of the collapsed eventalign .tsv from the two constructs I'm hoping to compare, to illustrate the correspondance between indices in both .tsv files, as well as the reference used (as .txt files).

Collapsed eventalign files: out_eventalign_collapse_ex1.tsv.txt out_eventalign_collapse_ex2.tsv.txt

Reference: seg_ref_w_A.fasta.txt

And for completeness this is the script I used:

#!/bin/sh

nanocompore sampcomp \
        --file_list1 /export/valenfs/data/processed_data/MinION/20201016_max_DNA/second_analysis/16_5_mer_collapse/construct1/A_in_reference/out_eventalign_collapse.tsv \
        --file_list2 /export/valenfs/data/processed_data/MinION/20201016_max_DNA/second_analysis/16_5_mer_collapse/construct2/out_eventalign_collapse.tsv \
        --label1 construct1_w_A_ref \
        --label2 construct2 \
        --fasta /export/valenfs/data/processed_data/MinION/20201016_max_DNA/second_analysis/17_5_mer_sampcomp/seg_ref_w_A.fasta  \
        --outpath /export/valenfs/data/processed_data/MinION/20201016_max_DNA/second_analysis/17_5_mer_sampcomp/results/ \
        --min_coverage 1 \
        --overwrite \
        --nthreads 100 \
        --pvalue_thr 0.01 \
        --comparison_methods GMM,KS,TT,MW \
        --logit

Any help would be greatly appreciated!

tleonardi commented 3 years ago

hi @Stautis sorry for the slow reply. Could you try running Nanocompore again (using the last version) and passing the option --log_level debug. You will then get a log file, can you post it here?

Stautis commented 3 years ago

hey @tleonardi

No problem. I updated to the 1.0.2 version of nanocompore, ran my script again with the option you mentioned and got the following log file.

out_sampcomp.log

tleonardi commented 3 years ago

Hi @Stautis, I've looked at your logfile. The main problem is that you have low coverage. A lot of reads are discarded by Nanocompore because they have a high % of invalid kmers (invalid according to Nanopolish, in the sense that the could not be resquiggled properly). After discarding these reads, the coverage is too low and the reference sequence is discarded.

You can try relaxing these two filtering parameters:

  --max_invalid_kmers_freq MAX_INVALID_KMERS_FREQ
                        Max fequency of invalid kmers (default: 0.1)
  --min_coverage MIN_COVERAGE
                        Minimum coverage required in each condition to do the comparison (default: 30)

I think the reason why you have such a high % of invalid kmers is that your reference sequence is pretty short (69nt). Short sequences are often problematic with Nanopore and on top of that even a single modification with a strong signal effect might cause a resquiggling issue for more than 7 kmers, thus causing the % of invalid kmers to go above the default of 10%. You could start by trying --max_invalid_kmers_freq 0.2 and see what you get.

Also, I've noticed that you have used an old version of NanopolishComp. I'd suggest that you upgrade to the latest because we have fixed a bug in the calculation of the fraction of invalid kmers. Since Nanocompore v1.0.3, NanopolishComp is included in the main Nanocompore package. You just need to use nanocompore eventalign_collapse. Let me know if you still have problems!

tleonardi commented 3 years ago

Closing due to inactivity. Will reopen if we get more feedback

tleonardi / nanocompore

Modifications in DNA samples #166