Problems in the output coordinates

icebert / eccDNA_RCA_nanopore

eccDNA identification from nanopore long reads of rolling-circle amplicon

MIT License

5 stars 4 forks source link

Problems in the output coordinates #5

Open WeijiaSu opened 2 years ago

WeijiaSu commented 2 years ago

Hello, I was trying to run the pipeline with my eccDNA data. I got some confusing results. My reference (>MLV) is 8857bp, however, in the output, some fragments are longer than the reference.

For example: f75c8bb0-38ca-4579-b1e6-a45a51d4d258 4 1 2053 735 MLV:8123-10175(+) 684c27d0-97bf-4900-b2a0-9ca1991625e4 2 1 2823 1362 MLV:7496-10318(+)

I also got some results with coordinates <0

For example: de744270-bb0d-40a9-9763-a3e37f0aabb4 4 2 2789 140 MLV:3034-3173(-)|MLV:-1115-1533(+)

Any suggestions would be appreciated.

Thanks Weijia

icebert commented 2 years ago

Hi, can you add the --verbose option and save the output log and send me? Then I can debug. Thanks.

WeijiaSu commented 2 years ago

Hi, Here is the log file with the --verbose option. Looks like a lot of entries have coordinates > 8857. The reference MLV is a retroelement, it has repetitive sequences LTR, which means the first ~600 bp and the last ~600 bp are the same. I am not sure if this caused the problem. I've also noticed this problem with other LTR transposon sequences. Thanks for your help.

eccRCA.log

icebert commented 2 years ago

Hi,

This is caused by big gaps between two sub-fragments in one Nanopore read. Reads with big gap between two sub-fragments are unreliable because sub-fragments from the RCA should be continuous. You can just filter out these reads with coordinates greater than the length of reference.

In addition, cautions should be taken when identifying eccDNA from highly repetitive sequences since RCA is used. For example, for sequence AGCT-AGCT-AGCT-AGCT, it's impossible to discriminate whether the original sequence is AGCT-AGCT and undergoes two round RCA, or the original sequence is AGCT and undergoes four round RCA.

WeijiaSu commented 2 years ago

Hi, Thanks for the reply. I filtered the reads with coordinates greater than the reference length and smaller than 0. I am wondering if there is a way to filter out all of the reads with sub-fragments with big gaps. I found some other reads do not make much sense even though they have reasonable coordinates, I thought those may also be caused by the gaps. Thanks for your help.

icebert commented 2 years ago

Sorry we do not have an option to filter out reads with large gaps. But one way to check whether the reads have large gaps is to comment out line 329 to 331. The threadFix deals with the gaps. When disabling this, the gaps will leave there and you can check the mapping of reads in the log with --verbose .

https://github.com/icebert/eccDNA_RCA_nanopore/blob/8b51b38a53fb1c110e28901b53b839d5495a573b/eccDNA_RCA_nanopore.py#L329-L331