Closed reslp closed 1 year ago
Hi Philipp,
You can examine these results more closely to determine the appropriate action.
1) When you run GX, there are a number of messages printed to the screen as logs including an aggregate coverage value Aggregate coverage:
just above the contamination summary. Can you post your results for that here? We more often see these ultra low-coverage contaminant cases in genomes with lower aggregate coverage.
2) You can grep
for the suspect contigs in the supplementary output file *taxonomy.rpt
. You can find the specs for that file here:
https://github.com/ncbi/fcs/wiki/FCS-GX-taxonomy-report
The low-coverage cases are most likely the consequence of short chimeric contaminant spans embedded within sequence that is ambiguous to GX (inconclusive/repeat/low-coverage). This idea is supported if your aggregate coverage from 1) is low. In *taxonomy.rpt
you should be able to find the sequence ranges labeled contaminant
and can do BLAST searches for validation. If you want to post a couple of example rows from the *taxonomy.rpt
on this thread I can try to weigh in.
3) I would also inspect the *taxonomy.rpt
file and do BLAST searches for sequences assigned EXCLUDE anml:fishes
with modest fish coverage. We have seen fish genomes generating noise in the results, particularly in repeat-laden regions. The assigned repeats info is in column 2 of *taxonomy.rpt
.
Just based on what you shared, my hunch is that the sequences marked zero coverage should be retained, not trimmed or excluded. It's possible that after steps 1-3 you conclude that the primate calls are the only high-confidence contaminants in your sample.
Hi,
many thanks for your fast reply and your explanations! I am looking forward to hear you opinions on this:
ad 1) I looked into the log file. The coverage is 7% and the program operates in low-coverage mode. This is directly from the log file:
* * * Asserted tax-div, anml:worms, conflicts with highest-coverage one, anml:insects. * * *
primary-divs: ['anml:worms'] (25%)
Top represented divs:
anml:insects 223043 bp
anml:worms 146862 bp
anml:fishes 28818 bp
* * * Warning: Fraction of primary-div = 25%, below 66% * * *
Aggregate coverage: 7%
Low-coverage mode. The following contaminant divs are out of scope
anml:insects : len:5304; div-cvg:49%
--------------------------------------------------------------------
fcs_gx_report.txt contamination summary:
----------------------------------------
seqs bases
----- ----------
TOTAL 36 832413
----- ----- ----------
prok:bacteria 1 310203
fung:budding yeasts 1 200499
fung:fungi 1 143096
prok:CFB group bacteria 1 118771
anml:fishes 23 44938
anml:primates 9 14906
--------------------------------------------------------------------
ad 2)
Thank you for pointing me to the explanation of the *.rpt
file. I have missed this completely. As an example I picked one of the contigs that was flagged EXCLUDE (lcl|scaffold593_cov139
).
From what I can see the whole contig is flagged initially, although only for the second part the final decision is contamination. Here are the corresponding lines from the report file:
cat Gblic_cleaned.fcs.adaptor.147586.taxonomy.rpt | grep "lcl|scaffold593_cov139"
lcl|scaffold593_cov139~~1..47195 47195 509,0,0,0 3419 | Candidatus Methanoperedens nitroreducens 1392998 arch:archaea 143 143 18 | 103695 anml:reptiles 230 142 16 | 6526 anml:molluscs 299 165 15 | 3469 plnt:plants 303 144 14 | 0 inconclusive none 0
lcl|scaffold593_cov139~~47196..49264 2069 0,0,0,0 1073 | Torulaspora microellipsoides 1136883 fung:budding yeasts 1045 883 56 | 226126 fung:budding yeasts 1045 888 54 | 329046 fung:chytrids 825 825 53 | 3218 plnt:mosses 530 530 46 |3 contaminant fung:budding yeasts 51
lcl|scaffold593_cov139~~49265..200499 151235 1593,58,0,0 10912 | Harmonia axyridis 115357 anml:insects 1976 866 49 | 214514anml:rodents 313 289 34 | 1043004 fung:ascomycetes 803 250 34 | 2940366 fung:ascomycetes 803 269 32 |0 low-coverage anml:insects 1
I am now wondering if these three lines are the reason the whole contig is flagged as EXCLUDE, because the three hits span the whole contig length (pos 1 to 200499) even though for lines 1 and 3 the decisions are inconclusive and low-coverage. Am I misinterpreting things here?
ad 3)
I still need to blast the flagged contaminations to double check. However, since (aggregate) coverage seems to be important and you mentioned modest coverage for fish sequences I am wondering what would be a sensible coverage cut off to decide what I should keep. What is the relationship between the aggregated coverage and the contaminations listed in the report? For example, would it make sense to decide based on aggregated coverage and ignore all contaminations in the report that are below 7%?
The whole rationale behind this is that I want to make sure my assemblies are clean before I perform complicated and time-consuming downstream analyses like functional annotations, phylogenomics, comparisons of gene content etc. and before I submit them to NCBI. Having additional contaminations flagged after I performed many analyses could potentially lead me to having to rerun many steps and would consume (lots of) additional time. Ideally I would like to have my assemblies clean before all of that so that they pass all NCBI check once I submit them. I have many genomes to analyze, this is why I am looking for an automated solution that mirrors the checks that NCBI is doing as closely as possible when genomes get submitted.
Many thanks again for your help!
kind regards,
Philipp
Hi Philipp,
Thanks for the additional info.
Your low aggregate coverage value (7%) may help to explain your results and observations here. This metric represents how well the input genome overlaps with sequences in the GX database. Low values indicate that it is a novel genome relative to the database and there is low overall overlap between the input genome and the correct taxonomic division, in this case, worms. So we have to be a little more wary of potential noise.
You are spot on with the example you posted. In this case only a 2 kbp span was flagged as a potential contaminant. Let me know what the BLAST results for lcl|scaffold593_cov139~~47196..49264 looks like. It doesn't appear to be repetitive. But since your aggregate coverage is low, any inconclusive or low-coverage rows should probably be interpreted as belonging to the source organism (i.e. not contaminant). So the two potential scenarios I see here is this is a valid small fungal contaminant and should be excised internally, or retain the whole sequence.
For the fish sequences, I would just try doing BLAST searches on a bunch (at least 10) and see what you find. There are a couple with 100% coverage, so see what those look like. Try both blastn and blastx. If the aggregate coverage of your genome is low but the contaminant coverage of the assigned contaminants is high, like the case with primates, we are generally more confident of the removal of these sequences even in novel genomes. But when the contaminant coverage is low, it's more suspect. Setting a threshold value is unlikely to be helpful...you are better off making decisions on a taxonomic division level, e.g. remove all primate calls, ignore all fish calls.
Thank you for pointing out your concern here...this is a major need for NCBI submitters that we hope to address with the FCS tool suite. If you remove all the sequences from your genome reported here, it should pass NCBI checks. But we also want to make sure that you aren't removing unnecessary sequences from the genome, and if you do some manual checking now you can have results ready to respond to NCBI in the event the sequences get flagged.
Thank your for your explaination. I now understand how to interpret the aggregated coverage percentage. It does indeed make sense. There are not many flatworm genomes publicly available yet.
I have extracted the sequence mentioned above ran BLASTX (against nr) and BLASTN (against nt). These are the top 5 results I got:
For BLASTX:
RID: R9D19ZMX013
Job Title:scaffold593_cov139_47196_49264
Program: BLASTX
Database: nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
Query #1: scaffold593_cov139_47196_49264 Query ID: lcl|Query_95855 Length: 2069
Sequences producing significant alignments:
Scientific Common Max Total Query E Per. Acc.
Description Name Name Taxid Score Score cover Value Ident Len Accession
unnamed protein product [Sparganum proliferum] Sparganum pr... NA 64606 478 478 43% 3e-160 74.50 398 VZH98552.1
unnamed protein product [Spirometra erinaceieuropaei] Spirometra e... NA 99802 477 477 43% 5e-160 74.50 396 VZI14686.1
unnamed protein product [Schistocephalus solidus] Schistocepha... NA 70667 471 471 43% 2e-158 73.83 336 VDL95269.1
unnamed protein product [Taenia asiatica] Taenia asiatica NA 60517 471 471 43% 1e-157 72.82 391 VDK24578.1
ribosomal protein l4 [Echinococcus multilocularis] Echinococcus... NA 6211 472 472 43% 4e-157 72.48 454 CDS40993.1
For BLASTN:
RID: R9CZMS8Y013
Job Title:scaffold593_cov139_47196_49264
Program: BLASTN
Database: nt Nucleotide collection (nt)
Query #1: scaffold593_cov139_47196_49264 Query ID: lcl|Query_6211 Length: 2069
Sequences producing significant alignments:
Scientific Common Max Total Query E Per. Acc.
Description Name Name Taxid Score Score cover Value Ident Len Accession
PREDICTED: Stegodyphus dumicola 60S ribosomal protein L4-A-lik... Stegodyphus ... NA 202533 174 174 28% 5e-38 72.43 1406 XM_035374257.1
Fusarium proliferatum ET1 probable ribosomal protein RPL4A... Fusarium pro... NA 1227346 171 171 10% 6e-37 81.22 1059 XM_031229915.1
Fusarium mangiferae putative ribosomal protein RPL4A... Fusarium man... NA 192010 171 171 10% 6e-37 81.22 1059 XM_041829102.1
Fusarium fujikuroi IMI 58289 probable ribosomal protein RPL4A... Fusarium fuj... NA 1279085 165 165 10% 3e-35 80.75 1059 XM_023574414.1
Morchella importuna uncharacterized protein (LAJ45_06824),... Morchella im... NA 1174673 165 165 10% 3e-35 80.66 1128 XM_046116385.1
While the blastn results do indicate fungal hits, e-value, percent identity and query coverage are low. For blastx instead all five best hits are from proteins of flatworms. This makes more sense. Given the information I have now, I would not exclude the sequence.
I still have to look into the flagged fish sequences. I do think that these are likely real contaminations, as the genomes is from a parasite living on fishes. The assembly has already be prefiltered using a blobology approach, which also incorporates blast assigned sequence identity information.
Thank you also for your thoughts on a cut-off threshold. I will keep thinking about possible solutions when processing many genomes. Of course there is also the possibility to run blast on all flagged contaminations to have an additional source of information.
blastn results do indicate fungal hits, e-value, percent identity and query coverage are low
would be good when you are reviewing to look at the start/end positions of hits to see if these blastn/blastx results are overlapping or non-overlapping. The query coverage is low but the fungal percent ID is rather high.
I do think that these are likely real contaminations, as the genomes is from a parasite living on fishes
This is also good information. Biologically these contaminants make sense.
Closing. Please follow-up if you have other questions.
Hi,
Thank you very much for providing the FCS tool. I have been running some tests with FCS-adaptor and FCS-GX. The programs run fine, but I do have some questions regarding the output of FCS-GX. I am asking this question because I plan to automatically filter assemblies based on the output of FCS-GX. The genome I analyzed is from Gyrodactylus bliccensis, a flatworm. The output of FCS-GX looks like this:
There are several contigs flagged as contaminations that are very long, yet the percentage alignment coverage of the contamination is 0. The suggestion of FCS-GX is exclude the contig. Obviously I do not want to exclude the whole contig if the contamination only spans a very small proportion. My question now is: What would be your recommendation on how to handle these cases? Is there a way to identify the exact position of the contamination so I can trim or mask it?
Many thanks already in advance for your help!
kind regards,
Philipp