ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

Question on how to interpret results of FCS-GX #20

Closed reslp closed 1 year ago

reslp commented 1 year ago

Hi,

Thank you very much for providing the FCS tool. I have been running some tests with FCS-adaptor and FCS-GX. The programs run fine, but I do have some questions regarding the output of FCS-GX. I am asking this question because I plan to automatically filter assemblies based on the output of FCS-GX. The genome I analyzed is from Gyrodactylus bliccensis, a flatworm. The output of FCS-GX looks like this:

cat Gblic_cleaned.fcs.adaptor.147586.fcs_gx_report.txt
##[["FCS genome report", 2, 1], {"git-rev": "0.2.3-14-g20bf86e2", "run-date": "Mon Nov 14 09:47:40 2022", "db": {"build-date": "2022-07-09", "seqs": 3086610, "Gbp": 708.351}, "run-info": {"agg-cvg": 0.074, "asserted-div": "anml:worms", "primary-divs": ["anml:worms"]}}]
#seq_id start_pos   end_pos seq_len action  div agg_cont_cov    top_tax_name
lcl|scaffold593_cov139  1   200499  200499  EXCLUDE fung:budding yeasts 0   Torulaspora microellipsoides
lcl|scaffold882_cov90   1   4546    4546    EXCLUDE anml:fishes 21  Esox lucius
lcl|scaffold3105_cov139 1   310203  380645  TRIM    prok:bacteria   0   Caenorhabditis sp. 32 LS-2015
lcl|scaffold3760_cov192 1   1891    1891    EXCLUDE anml:fishes 72  Tachysurus fulvidraco
lcl|scaffold4164_cov59  1   1322    1322    EXCLUDE anml:primates   100 Homo sapiens
lcl|scaffold4616_cov93  1   3540    3540    EXCLUDE anml:primates   92  Homo sapiens
lcl|scaffold5048_cov139 1   118771  118771  EXCLUDE prok:CFB group bacteria 0   Fulvivirga lutimaris
lcl|scaffold6063_cov58  1   2330    2330    EXCLUDE anml:primates   100 Homo sapiens
lcl|scaffold9512_cov118 1   1925    1925    EXCLUDE anml:fishes 25  Hypomesus transpacificus
lcl|scaffold10516_cov96 1   1155    1155    EXCLUDE anml:fishes 31  Paramormyrops kingsleyae
lcl|scaffold11553_cov105    1   1552    1552    EXCLUDE anml:fishes 58  Denticeps clupeoides
lcl|scaffold12100_cov139    1   143096  143096  EXCLUDE fung:fungi  0   Apophysomyces ossiformis
lcl|scaffold12173_cov223    1   1527    1527    EXCLUDE anml:primates   100 Homo sapiens
lcl|scaffold16036_cov53 1   1348    1348    EXCLUDE anml:primates   100 Homo sapiens
lcl|scaffold16326_cov209    1   2773    2773    EXCLUDE anml:fishes 49  Esox lucius
lcl|scaffold17339_cov56 1   1289    1289    EXCLUDE anml:primates   68  Homo sapiens
lcl|scaffold18723_cov113    1   3544    3544    EXCLUDE anml:fishes 41  Tachysurus fulvidraco
lcl|scaffold21072_cov51 1   1443    1443    EXCLUDE anml:fishes 100 Esox lucius
lcl|scaffold23064_cov139    1   1244    1244    EXCLUDE anml:primates   100 Homo sapiens
lcl|scaffold23200_cov161    1   2845    2845    EXCLUDE anml:fishes 36  Petromyzon marinus
lcl|scaffold23907_cov84 1   1457    1457    EXCLUDE anml:fishes 30  Larimichthys crocea
lcl|scaffold30334_cov56 1   1194    1194    EXCLUDE anml:primates   100 Homo sapiens
lcl|scaffold30481_cov68 1   1036    1036    EXCLUDE anml:fishes 39  Alosa sapidissima
lcl|scaffold31811_cov133    1   1089    1089    EXCLUDE anml:fishes 67  Esox lucius
lcl|scaffold32743_cov113    1   1714    1714    EXCLUDE anml:fishes 47  Danio rerio
lcl|scaffold36904_cov71 1   1333    1333    EXCLUDE anml:fishes 100 Esox lucius
lcl|scaffold38887_cov134    1   1179    1179    EXCLUDE anml:fishes 99  Esox lucius
lcl|scaffold42681_cov75 1   1112    1112    EXCLUDE anml:primates   100 Homo sapiens
lcl|scaffold43151_cov87 1   1725    1725    REVIEW  anml:fishes 18  Anguilla anguilla
lcl|scaffold46639_cov105    1   1595    1595    EXCLUDE anml:fishes 31  Anguilla anguilla
lcl|scaffold52186_cov89 1   1453    1453    EXCLUDE anml:fishes 25  Paramormyrops kingsleyae
lcl|scaffold57901_cov220    1   1554    1554    EXCLUDE anml:fishes 55  Danio rerio
lcl|scaffold93381_cov103    1   1435    1435    EXCLUDE anml:fishes 58  Esox lucius
lcl|scaffold93402_cov123    1   1711    1711    EXCLUDE anml:fishes 20  Esox lucius
lcl|scaffold93509_cov105    1   1642    1642    EXCLUDE anml:fishes 31  Esox lucius
lcl|scaffold93512_cov132    1   4341    4341    EXCLUDE anml:fishes 26  Danio rerio

There are several contigs flagged as contaminations that are very long, yet the percentage alignment coverage of the contamination is 0. The suggestion of FCS-GX is exclude the contig. Obviously I do not want to exclude the whole contig if the contamination only spans a very small proportion. My question now is: What would be your recommendation on how to handle these cases? Is there a way to identify the exact position of the contamination so I can trim or mask it?

Many thanks already in advance for your help!

kind regards,

Philipp

etvedte commented 1 year ago

Hi Philipp,

You can examine these results more closely to determine the appropriate action.

1) When you run GX, there are a number of messages printed to the screen as logs including an aggregate coverage value Aggregate coverage: just above the contamination summary. Can you post your results for that here? We more often see these ultra low-coverage contaminant cases in genomes with lower aggregate coverage.

2) You can grep for the suspect contigs in the supplementary output file *taxonomy.rpt . You can find the specs for that file here: https://github.com/ncbi/fcs/wiki/FCS-GX-taxonomy-report The low-coverage cases are most likely the consequence of short chimeric contaminant spans embedded within sequence that is ambiguous to GX (inconclusive/repeat/low-coverage). This idea is supported if your aggregate coverage from 1) is low. In *taxonomy.rpt you should be able to find the sequence ranges labeled contaminant and can do BLAST searches for validation. If you want to post a couple of example rows from the *taxonomy.rpt on this thread I can try to weigh in.

3) I would also inspect the *taxonomy.rpt file and do BLAST searches for sequences assigned EXCLUDE anml:fishes with modest fish coverage. We have seen fish genomes generating noise in the results, particularly in repeat-laden regions. The assigned repeats info is in column 2 of *taxonomy.rpt.

Just based on what you shared, my hunch is that the sequences marked zero coverage should be retained, not trimmed or excluded. It's possible that after steps 1-3 you conclude that the primate calls are the only high-confidence contaminants in your sample.

reslp commented 1 year ago

Hi,

many thanks for your fast reply and your explanations! I am looking forward to hear you opinions on this:

ad 1) I looked into the log file. The coverage is 7% and the program operates in low-coverage mode. This is directly from the log file:

 * * * Asserted tax-div, anml:worms, conflicts with highest-coverage one, anml:insects. * * *

primary-divs: ['anml:worms'] (25%)
Top represented divs:
    anml:insects     223043 bp
    anml:worms       146862 bp
    anml:fishes       28818 bp

 * * * Warning: Fraction of primary-div = 25%, below 66% * * *

Aggregate coverage: 7%
Low-coverage mode. The following contaminant divs are out of scope
anml:insects : len:5304; div-cvg:49%

--------------------------------------------------------------------

fcs_gx_report.txt contamination summary:
----------------------------------------
                                seqs      bases
                               ----- ----------
TOTAL                             36     832413
-----                          ----- ----------
prok:bacteria                      1     310203
fung:budding yeasts                1     200499
fung:fungi                         1     143096
prok:CFB group bacteria            1     118771
anml:fishes                       23      44938
anml:primates                      9      14906

--------------------------------------------------------------------

ad 2)

Thank you for pointing me to the explanation of the *.rpt file. I have missed this completely. As an example I picked one of the contigs that was flagged EXCLUDE (lcl|scaffold593_cov139). From what I can see the whole contig is flagged initially, although only for the second part the final decision is contamination. Here are the corresponding lines from the report file:

cat Gblic_cleaned.fcs.adaptor.147586.taxonomy.rpt | grep "lcl|scaffold593_cov139"
lcl|scaffold593_cov139~~1..47195    47195   509,0,0,0   3419    |   Candidatus Methanoperedens nitroreducens    1392998 arch:archaea    143 143 18  |   103695  anml:reptiles   230 142 16  |   6526    anml:molluscs   299 165 15  |   3469    plnt:plants 303 144 14  |   0   inconclusive    none    0
lcl|scaffold593_cov139~~47196..49264    2069    0,0,0,0 1073    |   Torulaspora microellipsoides    1136883 fung:budding yeasts 1045    883 56  |   226126  fung:budding yeasts 1045    888 54  |   329046  fung:chytrids   825 825 53  |   3218    plnt:mosses 530 530 46  |3  contaminant fung:budding yeasts 51
lcl|scaffold593_cov139~~49265..200499   151235  1593,58,0,0 10912   |   Harmonia axyridis   115357  anml:insects    1976    866 49  |   214514anml:rodents  313 289 34  |   1043004 fung:ascomycetes    803 250 34  |   2940366 fung:ascomycetes    803 269 32  |0  low-coverage    anml:insects    1

I am now wondering if these three lines are the reason the whole contig is flagged as EXCLUDE, because the three hits span the whole contig length (pos 1 to 200499) even though for lines 1 and 3 the decisions are inconclusive and low-coverage. Am I misinterpreting things here?

ad 3)

I still need to blast the flagged contaminations to double check. However, since (aggregate) coverage seems to be important and you mentioned modest coverage for fish sequences I am wondering what would be a sensible coverage cut off to decide what I should keep. What is the relationship between the aggregated coverage and the contaminations listed in the report? For example, would it make sense to decide based on aggregated coverage and ignore all contaminations in the report that are below 7%?

The whole rationale behind this is that I want to make sure my assemblies are clean before I perform complicated and time-consuming downstream analyses like functional annotations, phylogenomics, comparisons of gene content etc. and before I submit them to NCBI. Having additional contaminations flagged after I performed many analyses could potentially lead me to having to rerun many steps and would consume (lots of) additional time. Ideally I would like to have my assemblies clean before all of that so that they pass all NCBI check once I submit them. I have many genomes to analyze, this is why I am looking for an automated solution that mirrors the checks that NCBI is doing as closely as possible when genomes get submitted.

Many thanks again for your help!

kind regards,

Philipp

etvedte commented 1 year ago

Hi Philipp,

Thanks for the additional info.

Your low aggregate coverage value (7%) may help to explain your results and observations here. This metric represents how well the input genome overlaps with sequences in the GX database. Low values indicate that it is a novel genome relative to the database and there is low overall overlap between the input genome and the correct taxonomic division, in this case, worms. So we have to be a little more wary of potential noise.

You are spot on with the example you posted. In this case only a 2 kbp span was flagged as a potential contaminant. Let me know what the BLAST results for lcl|scaffold593_cov139~~47196..49264 looks like. It doesn't appear to be repetitive. But since your aggregate coverage is low, any inconclusive or low-coverage rows should probably be interpreted as belonging to the source organism (i.e. not contaminant). So the two potential scenarios I see here is this is a valid small fungal contaminant and should be excised internally, or retain the whole sequence.

For the fish sequences, I would just try doing BLAST searches on a bunch (at least 10) and see what you find. There are a couple with 100% coverage, so see what those look like. Try both blastn and blastx. If the aggregate coverage of your genome is low but the contaminant coverage of the assigned contaminants is high, like the case with primates, we are generally more confident of the removal of these sequences even in novel genomes. But when the contaminant coverage is low, it's more suspect. Setting a threshold value is unlikely to be helpful...you are better off making decisions on a taxonomic division level, e.g. remove all primate calls, ignore all fish calls.

Thank you for pointing out your concern here...this is a major need for NCBI submitters that we hope to address with the FCS tool suite. If you remove all the sequences from your genome reported here, it should pass NCBI checks. But we also want to make sure that you aren't removing unnecessary sequences from the genome, and if you do some manual checking now you can have results ready to respond to NCBI in the event the sequences get flagged.

reslp commented 1 year ago

Thank your for your explaination. I now understand how to interpret the aggregated coverage percentage. It does indeed make sense. There are not many flatworm genomes publicly available yet.

I have extracted the sequence mentioned above ran BLASTX (against nr) and BLASTN (against nt). These are the top 5 results I got:

For BLASTX:

RID: R9D19ZMX013    
Job Title:scaffold593_cov139_47196_49264                
Program: BLASTX 
Database: nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
Query #1: scaffold593_cov139_47196_49264 Query ID: lcl|Query_95855 Length: 2069

Sequences producing significant alignments:
                                                                  Scientific      Common                     Max    Total Query   E    Per.   Acc.                        
Description                                                       Name            Name            Taxid      Score  Score cover Value  Ident  Len        Accession        
unnamed protein product [Sparganum proliferum]                    Sparganum pr... NA              64606      478    478   43%   3e-160 74.50  398        VZH98552.1       
unnamed protein product [Spirometra erinaceieuropaei]             Spirometra e... NA              99802      477    477   43%   5e-160 74.50  396        VZI14686.1       
unnamed protein product [Schistocephalus solidus]                 Schistocepha... NA              70667      471    471   43%   2e-158 73.83  336        VDL95269.1       
unnamed protein product [Taenia asiatica]                         Taenia asiatica NA              60517      471    471   43%   1e-157 72.82  391        VDK24578.1       
ribosomal protein l4 [Echinococcus multilocularis]                Echinococcus... NA              6211       472    472   43%   4e-157 72.48  454        CDS40993.1

For BLASTN:

RID: R9CZMS8Y013    
Job Title:scaffold593_cov139_47196_49264                
Program: BLASTN 
Database: nt Nucleotide collection (nt)
Query #1: scaffold593_cov139_47196_49264 Query ID: lcl|Query_6211 Length: 2069

Sequences producing significant alignments:
                                                                  Scientific      Common                     Max    Total Query   E   Per.   Acc.                        
Description                                                       Name            Name            Taxid      Score  Score cover Value Ident  Len        Accession        
PREDICTED: Stegodyphus dumicola 60S ribosomal protein L4-A-lik... Stegodyphus ... NA              202533     174    174   28%   5e-38 72.43  1406       XM_035374257.1   
Fusarium proliferatum ET1 probable ribosomal protein RPL4A...     Fusarium pro... NA              1227346    171    171   10%   6e-37 81.22  1059       XM_031229915.1   
Fusarium mangiferae putative ribosomal protein RPL4A...           Fusarium man... NA              192010     171    171   10%   6e-37 81.22  1059       XM_041829102.1   
Fusarium fujikuroi IMI 58289 probable ribosomal protein RPL4A...  Fusarium fuj... NA              1279085    165    165   10%   3e-35 80.75  1059       XM_023574414.1   
Morchella importuna uncharacterized protein (LAJ45_06824),...     Morchella im... NA              1174673    165    165   10%   3e-35 80.66  1128       XM_046116385.1

While the blastn results do indicate fungal hits, e-value, percent identity and query coverage are low. For blastx instead all five best hits are from proteins of flatworms. This makes more sense. Given the information I have now, I would not exclude the sequence.

I still have to look into the flagged fish sequences. I do think that these are likely real contaminations, as the genomes is from a parasite living on fishes. The assembly has already be prefiltered using a blobology approach, which also incorporates blast assigned sequence identity information.

Thank you also for your thoughts on a cut-off threshold. I will keep thinking about possible solutions when processing many genomes. Of course there is also the possibility to run blast on all flagged contaminations to have an additional source of information.

etvedte commented 1 year ago

blastn results do indicate fungal hits, e-value, percent identity and query coverage are low

would be good when you are reviewing to look at the start/end positions of hits to see if these blastn/blastx results are overlapping or non-overlapping. The query coverage is low but the fungal percent ID is rather high.

I do think that these are likely real contaminations, as the genomes is from a parasite living on fishes

This is also good information. Biologically these contaminants make sense.

etvedte commented 1 year ago

Closing. Please follow-up if you have other questions.