questions about the '--downsample_high_coverage' option

tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data

https://nanocompore.rna.rocks

GNU General Public License v3.0

80 stars 12 forks source link

questions about the '--downsample_high_coverage' option #125

Closed lingolingolin closed 4 years ago

lingolingolin commented 4 years ago

Hi @a-slide @tleonardi ,

Thanks a lot for developing nanocompore. I have a question related to its 'downsample_high_coverage' option. Is down-sampling done at each reference position/site independent from each other? If this is the case, then it means different reads' events would be used for the analyses, right?

Thanks in advance and I look forward to hearing from you.

a-slide commented 4 years ago

Hi @lingolingolin,

Nope it is done at read level in the Whitelist module.

Cheers Ad

lingolingolin commented 4 years ago

thanks @a-slide , do you know how I can trace back which specific reads have been selected?

lingolingolin commented 4 years ago

Hi @a-slide, i have another relevant question: I always ran into error: nanocompore.common.NanocomporeError: The result database is empty unless I switched on the --downsample_high_coverage option. I also tried using picard's DownsampleSam to sample mapped reads and re-run nanocompre accordingly. but it always gave me the empty database error. I doubt this could be due to uneven or not enough coverage of the transcript. so I tried Picard downsampling different amount of reads. I checked the collapsed events file and make sure that coverage in both samples involved in the comparison have common sites with enough coverage. but I had no luck to make nanocompore work.
Do you have any hint what could be the potential reason? The command I used is below: nanocompore sampcomp -1 control.downsample.50pcnt.events.collapse.tsv -2 treatment.events.collapse.tsv -f ref.fasta -o nanocompore_samcomp_context4 --pvalue_thr 1 --logit --nthreads 3 --sequence_context 4 --comparison_methods GMM,KS,MW,TT --overwrite

a-slide commented 4 years ago

Hi @lingolingolin, Not directly unfortunately, but it can be done if you run Nanocompore via the API. In that case you can first instantiate a version of Whitelist manually and then get the list of selected reads via the ref_reads attribute. Then you can pass the whitelist object to SampComp.

Alternatively it is also possible to retrieve the read ids in the database generated by SampComp, but it is very complicated as the database is indexed by reference id and positions. In addition the feature is only implemented in the branch readid_tracking at the moment

The database is a Python3 shelve that can be open as follow to get the list of read ids

import shelve

db_fn = "Path_to_db"
with shelve.open(db_fn, flag='r') as db:
    # iterate to get all the read_ids
    db[refid][pos]["data"][cond_lab][sample_lab]["reads"]

I will raise a feature request issue

lingolingolin commented 4 years ago

thanks a lot @a-slide . I will have a try with both approaches.