ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

Are endogenous elements removed by FCS? #41

Closed BenjaminGuinet closed 12 months ago

BenjaminGuinet commented 1 year ago

Dear Foreign Contamination Screen developers.

In submitting sequences to Genbank as genome assemblies, I was wondering if FCS actually considers endogenous viral elements as contaminating sequences to be removed from the assembly?

Thank you very much for your time.

All the best,

Benjamin

etvedte commented 1 year ago

Hi Benjamin,

When running FCS-GX on eukaryote assemblies, sequences identified as entirely virus are given the GX action EXCLUDE to be removed from the assembly, as there is no supported evidence to suggest it is an endogenous element. All other cases are ignored by GX automated cleaning steps, including identified viral chimeras in eukaryote genomes and any viral element in a prokaryote genome.

If you identify a case where you believe GX is removing a valid EVE, please let us know.

Eric

BenjaminGuinet commented 1 year ago

Dear Eric, in my experience FCS-GX doesn't seem to delete any of the EVEs I've previously annotated, which I think is a good point.

However, I took a look into one specific genome and I saw that a chuviral sequence was deleted by FC-GX on this contig for example :

ATAAAAATAAATATATTATATATAATTCCATTCACTGTATAATGATGACGTATTTCGAATTTCCGATTCA
TAACGCCTTGTGAATCGGCCACAGTAACAAATTTTCATTCCGATTCAAATAACTGATGTCCTATTCAATA
AGTGATGCAAATAATCAGAGGCAATCTGCCTTTGCCTTTTAAGAATGCGGATTCACTAGAGCTTCCGACC
GATTCAAAAAATTTGAATAGGTTTTTATTTCCTGGGCTTTGTGTTCACTCGTAAGTGGATGAGATCCATA
TCTTTTCTGTAGCGATTGGTGATGTACTGAAAAGACATCATAAAATTCCTGCATTATACCTGTACACGCA
TCTTTAAAGAGAGATCTCTCTGATAATTATCAAGTTTAATCCCTGAATCGATAAATTTCTTGACGGATAC
TGATATAATCATTTCAGAGACCTGACCAACAGATGTTGCTGAAATCTTGTGAACATCTATATATTTTAGT
TGTCTTGTGTCACATCTGATTGGCTTCTCTTTGATTGGTCTCATACAATAATTACATTCTCCTGTAACAG
CCCACACAATTTTATCTACGTTGAAATTCTAACTGTTGGAATATTTGATTTGAATAATTTGCCATACAGT
ATAACATGTGACAACTCGGTCACGAGAATTAATTGTTTTAGAATAAGGGGTGGGCGCGTCTCGGACGGGA
AGCACACTTACCGAAATTCGACTGGAAGCTCGGCGTCTAGGAGTCAAGCAAACGGCTCAGACGAGCAAAG
TAACCTTCAAGAGCAGACTTAACACCCTAAGGCAGTCGCGTGGTTGGCAGGATCGCCGGCAGCGATCGGG
CGACCTCTTAACGCGTTTCTAGGACTAAATGGCGCGATTATTTAAATCGCGCTAAATGCGTTGAAGGGGA
ATCGAGATTGGAGAAAATATTAACTAACCTGAAAATAAAAAATTTGATTGGGGTCGTCACAAACATAAAA
TATGTAAAAAGTTAACTTTATAATGTTGTGTTGATGCCTTCATATGTTTATGTGTGTTTGATGATCCTAT
CACCTGTTGATAAACATTACTCAATGTGTTAGACGCTATCGACTCAACAAAATTAGGAGCTCATTTCTTC
TGAAGAATCTGCCCTAAACAATTCTTTTATATCCTCATTCTTTGTCGTAGATCTCAAAGCTGTAATTACT
GACTTCCTTAAATAAGTTGATGGAAGCATTGGCCTATCTATAGGCAAAGAATAGGAATCTTTAAATAATC
CTGTATAATCTGTAATTTCTTGGTCAGGTATACATAGAAATTGTTCCATGTATGAATAAATTTCATAATT
CTTTTTCATTGAAAACTTCATTAGACCTATGAAAGGACTCAATAGGTCCGACTCTGCTCGAACAAACATG
TTGTGTAGATAGATTATAGGAAACCCTCCTACTATACTTGATGTCAATAGGAGAGCTGTAATTTGCTTAT
TAGTTAGATCCTTGTATTTGGAATCCTTGAATTTAGAAAAGAGTCCAGAATAATGCTACCATACAGCAAG
GAGTATATGTAGAAGTCACTTTACAAGCACTGTGTGCATTGCTATATGCAGATGCTATATACTCATCAAG
AGTGTTAAGAAAAGCATTGTTTGCTCCATGACATTTTTGAATTTTCCTAAATCCTTGTGGTAACTATCGT
ATTCCATGATGCAGTTTTGGAAAAGGACAAATATGTTTCTGTTCCATAACTGTCCTCTACTTTAATTTTG
TGTCCAAATTCCTTTAATTCACTAGATAGAGTATTCACTATTTAATCCTTAAAATTCGTCATATCCAGTT
TGTTGATGAGCAGAGCTTACCATTACAATAATTACTCTCACATCATCTCCTTTACGCAATATGTAATATG
GTAATCCAAGTTTGCCCATAGCAACATTAACTTGTCCTAAGTAGATTATTACCCATGTATCTTGGTTTAA
CCCATTTATTCCACCCGCTTGTCCATCCCACTAGTAACATTTTCGTTCATCAGGAACATAGAACATTGTT
TGTTCATATGCAAGATGTGTTTTACTGAACATACTTGTCTTGAATACTCTATCTAAGGTTTCCTTCATAA
CTCCTTTAACTGTTAATCCTCTAAATCTATTGTTCCATCCAGAAGCATCAATATTGATGTTTAGTTGTCG
GAATCCACGATAAACTTTCTTTAAATTTCTTAATGCATATAGTTTCTTAAGAAGAGCAAGCTCTCCCAAA
GTCATCGCTTGTTCGTCAGAATACTCGTCAAGAAATCTCATAGCATTTTTCTCTTGCACAAGACATCTTG
TCCTGTCTTCGTAAGTCTTACAACCAAATCCTCGACAGCTTACTTTAAGCTCTTTCTCTTTCGGGACAAT
TTTAATGGTTAGATAATCCAATAAATCTGTCAAATCCTCACTTGACTCATATTTGTCAATATATTCTACA
TAGTGATGAAAAATCTTATGGTTCAGGAGATAAACTAACAAAAGTCTAGTTTCTTTCCATGATGTTTCGG
TATATTCACTTTCTCTGTTTATAATATCAATATGATGGCTGGGTCATATGGATCTGTATTTTTCCAATAT
GCATCCTTGATTGCTTGATGAACATGCTGATCTGGTGGAGCTATTGGAGGCCATCTTTTCTTTCATAATA
TATTATTCCAACACCGTGAGCTTGAAACCATTTTAACCCAAAGGACTTTGCACCTGGGATTTTATTTTTT
ATTTTGATTCTAGGGTGTAATCAGATGATTTTGCAACAAAAATTTAAATAATCGCCGGAGCCCTTTTTGA
GAAACATCGATTTTAATTTTTAAAATAACGAAAATTCAATTTTGCTTTAAAAATTTCGATTTTTGAAAAA
AGTTTTAGGTATATGCAAAATTTATAGATATTAAAAAAAAGGAATATTTTTTATATAAACCATTTTTTTA
TACCATCAATAATTTCCAAGATGAACGAAAAAACTGACATTTTACGAGTTTTTAAAAGAATTTAACATAA
AAAATTTTTTTAAATTGTTTAATTCGGAGTGACTCTATCAAGCGAAGTATTTCGCTATTGTTTAGCGCGT
CGGTGACCCAATTCTCCTCAATACCAAACAATGCCTTTGCAATCACTATCCACAATCGCTCACTTTGAAG
TGGATAGCGACAATAAAGTGCGTGGTACGTCTTAAAAGCTATTTCAAACGCACGTAATGGCGATTCTACT
AGGTACGGTGTCCCATCTATCACAAAATAATTGCTCGTAATCGTTGATTTCGTTGTTCCAACAACAAGTG
GGAATGGCTGCAGTGTTGTGCCTCTCTGTCCTGCCGTATTTTTCATTCCTTAAATCGTCTTTTCAAAATC
ACCTGAAATTTTCTTTTTGAATTTAATTATACAGTGAGAAAATATGATCCCAGACGTGCAAAGACCTTTG
AGTTAAATTGGTTTCAAGCTCACCGTGCCGAATGTAATTAATGATTGAGTAATTTTTACCCTCATTCACT
TTAACCATAATTATTTTGAGAGGTTCTGTTGTCTTTGTAAATAGGTCGACACTGCCTTGCTTCATATTAA
TAAATTGATGTCCTGCAATTTTACTTAAGCATGACAATTCATGTCTCATTGGCATTGATACTGATGTAAG
GATCTCTTTGATTCTACAGATTGATTGTAATCAAAATCCAGTTTTGTTAATAATTCATCTTCTAAATTCT
TTAAAAATGTGTCATTGACCCATGTTTCAGATTCTTTTAATGATTCTCCAATATTTAATCCTTCCAAGTT
CTTGAGAATTGTGTATGTCTCATTTTTGTGAATTACAAAAAGTTTACATATCTCTTTGATAAAATCACAA
GATGTGTCAAAAGCATTGACATCCAATGCACTCCTTTCAGCATACATTGAATATATCAATGTTGATAACA
AGTCCATGATTTTGTTATACATCATTACTACATATGGCCGAGGTAATAAATATGTACTTTCTTTATAAGT
TACATGATCGATTGTAAACAAATTTCATATCCAGTCCTTTCATGTGATAACAACACTTTGCTTGAATTGC
TCTCATTTGATTATCTGTTAACATGTCATTAACTTTATATCCCGACAAGTGTGAAATTTTCTCTATAAGT
GTTTCAAATTCACTAGCTATATAAATTAGATTCAATATCTCTGGTTTGAGAGATGTATTATCAAACAAGT
CAGTTCTTACTACCTGAGCAGCTTTTGATGATTCTAAAAAATCCCTCTCATTCATATCCCAGTAAAAAAA
AAGAATTGGGAGTGCCGAAAAAGAATTGGGAGTTAGCTCAAGCTCTATTCTGTTTGGCATGAAAACAGAA
TTGGGACGTTACCGTGAATTCTTTTCACCTGAATCATTTGAATTGACGTTTGTTGATTCGAGTTTGCATT
GTAAATGAAA

However, as the sequencing was carried out using an illumina technique, it is unlikely that this contig comes from a free-living contaminant RNA. Furthermore, we know that endogenous viral elements tend to accumulate in regions of genomes with a high density of transposable elements, which could make assembly in these regions more difficult, thereby favouring the appearance of small fragments around EVEs. Although we cannot be certain that these sequences are EVEs, the inevitability of finding a small contig with only one hit from an RNA virus is I think not incompatible with an EVE rather than a contaminant.

Ben

etvedte commented 1 year ago

Is this the complete contig sequence? Can you post the row(s) for the action report (fcs_gx_report.txt) and the taxonomy report (taxonomy.rpt) for this contig?

BenjaminGuinet commented 1 year ago

Yes sure.

It is the complete contig and here is the raw from the report :

Exclude:
Sequence name, length, apparent source
scaffold18160   4420    virs:viruses
etvedte commented 1 year ago

I want to see what species taxids in the GX database are being hit and whether or not repeats were identified, as you mentioned this in an earlier comment. For that, I would like to see the corresponding row(s) in the gx report and taxonomy report output for this sequence.

The fcs_gx_report.txt format should be eight columns matching this https://github.com/ncbi/fcs/wiki/FCS-GX#fcs-gx-report-output

The taxonomy.rpt format should be 34 columns matching this https://github.com/ncbi/fcs/wiki/FCS-GX-taxonomy-report

BenjaminGuinet commented 1 year ago

Where can I see these files? I only got the following report file from my NCBI submission : fixed foreign contaminations ( [FixedForeignContaminations_final_assembly.txt]

murphyte commented 1 year ago

Is your submission SUB13515533? I think I've found your files. We're still working on the public reporting through the submission system, so the file that's posted to the portal is a bit sparse at the moment. If you'd like, I could post the full report here or e-mail it to you if you provide an address.

For scaffold18160, it is a partial coverage hit (see RID-8KE12RWS013). That's definitively virus, but also quite distinct from anything known which makes it harder to interpret the lack of flanking coverage (distant cross-species hits are predominantly on better conserved CDS regions, so we don't require high coverage). If the sequence were longer it would have either fallen below coverage thresholds, or been more likely to pick up some insect hits on either side if it is a true integrant in which case it wouldn't have been called. Your point about RNA viruses is a good one, but primarily applies for environmental contamination. We see a lot of evidence of contamination that likely arises in the laboratory somewhere (including downstream on the sequencing machines), which creates opportunities to pick up RNA-sourced contaminants. So the bottom line is nothing is simple. We'll do some broader review of potential RNA virus hits to explore the area further.

For the hits that weren't auto-cleaned, they appear to be lower scoring and they could mostly stay. mito and vector hits should be addressed.

Are there other calls you're interested in?

BenjaminGuinet commented 1 year ago

Yes it was that one. Yes if you can it would be nice : Benjamin.guinet95@gmail.com

I fully understand that this is not an easy question to answer and that it's not always easy to come to a conclusion. Then there's the question of systematically keeping things even suspicious and not losing too many false negatives, or on the contrary, being careful not to include too many false positives.

Thanks for all your explanations, it's much clearer for me now.

Have a nice day.

Benjamin

murphyte commented 1 year ago

FYI, our submitted paper is available on BioRxiv: https://biorxiv.org/cgi/content/short/2023.06.02.543519v1

I'll send you the report later today.