ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

fcs-gx: contamination in overview and taxonomy, but 0bp suggestions #14

Closed JudithR closed 1 year ago

JudithR commented 1 year ago

Hi, I used fcs-gx for a bird genome assembly using singularity with version 0.2.3 following the instructions in the wiki. For the test I get similar results to the ones given, the major difference being that fcs-gx now also gives REVIEW suggestions (fcsgx_test.fa.6973.fcs_gx_report.txt).

I ran with $SHM_LOC as file location (in the absence of sudo permissions) .

python3 ./run_fcsgx.py --fasta ./cleaned_sequences/contig_corrected.fa --out-dir ./gx_out/ --gx-db "${SHM_LOC}/gxdb/all" --gx-db-disk ./gxdb --split-fasta --tax-id $TAXID--container-engine=singularity --image=fcs-gx.0.2.3.sif

When I run it with all on my data, I get the following log:

Processed 742 queries, 1273.9Mbp in 228.895s. (5.56543Mbp/s)
Source file /output-volume//contig_corrected.$TAXID.taxonomy.rpt.tmp

primary-divs: ['anml:birds'] (99%)
Top represented divs:
    anml:birds   1088873214 bp
    prst:algae        26404 bp
    anml:fishes       23058 bp
    anml:insects      22718 bp

Aggregate coverage: 86%

--------------------------------------------------------------------

fcs_gx_report.txt contamination summary:
----------------------------------------
                                seqs      bases
                               ----- ----------
TOTAL                              0          0
-----                          ----- ----------

--------------------------------------------------------------------

fcs_gx_report.txt action summary:
---------------------------------
                                seqs      bases
                               ----- ----------
TOTAL                              0          0
-----                          ----- ----------

--------------------------------------------------------------------

The top represented divs suggest the presence of other taxa than birds in the assembly In the taxonomy.rpt I do get 4 lines labeled as contaminant:

lcl|scaffold_110~668195..696803 28609   1354,22273,60,0 6607    |       Amblyraja radiata       386614  anml:fishes     2544    850     56      |       7757    anml:fishes     2544    743     41      |       9860    anml:mammals    937     466     40      |       7160    anml:insects    1403    633     36      |       4       contaminant     anml:fishes     9
lcl|scaffold_110~697004..800915~~509..1873      1365    0,0,0,0 1164    |       Carcharodon carcharias  13397   anml:fishes     960     922     39      |       307658  anml:insects    761     679     34      |       33528   anml:fishes     960     266     19      |       6526    anml:molluscs   167     148     19      |       3       contaminant     anml:fishes     70
lcl|scaffold_201~1..120194~~65337..69151        3815    222,1217,0,0    3064    |       Petromyzon marinus      7757    anml:fishes     2767    2767    77      |       6573    anml:molluscs   1312    1312    59      |       2762    prst:protists   130     97      12      |       2778097 fung:basidiomycetes     132     72      11      |       3       contaminant     anml:fishes     73
lcl|scaffold_73~761840..798847  37008   1691,26206,180,0        8490    |       Chanos chanos   29144   anml:fishes     2828    781     44      |       390379  anml:fishes     2828    993     41      |       210409  anml:crustaceans        642     312     30      |       931557  anml:insects    2594    599     30      |       3       contaminant     anml:fishes     8

However the summary of the report in the log is empty, as is the report itself:

##[["FCS genome report", 2, 1], {"git-rev": "0.2.3-14-g20bf86e2", "run-date": "Thu Oct 13 14:58:37 2022", "db": {"build-date": "2022-07-09", "seqs": 3086610, "Gbp": 708.351}, "run-info": {"agg-cvg": 0.859, "asserted-div": "anml:birds", "primary-divs": ["anml:birds"]}}]
#seq_id start_pos       end_pos seq_len action  div     agg_cont_cov    top_tax_name

Is this output correct given the logs and the taxonomy report? If not, any suggestions to where it might have gone wrong?

Thx a lot Judith

etvedte commented 1 year ago

Hi Judith,

Thank you for your interest in GX!

The output looks correct, and GX is asserting your genome is clean. The taxonomy matches in Top represented divs: are those that had the highest aggregate alignment coverage but aren't necessarily included in the final contamination report due to various filtering criteria that is used to ignore lower confidence contaminant assignments. We will modify our reporting to reduce confusion, thanks for pointing this out.

The taxonomy report show four fish assignments that are short spans (lengths are in columns immediately after anml:fishes) and are repeat-rich (column 3). It would be helpful if you could copy the full set of rows for these scaffolds, not just the contaminant ones. With the information you provided I'm guessing GX throws out these calls because they are short, intra-kingdom spans. Animal-in-animal contamination is only reported if it is of sufficient length and coverage.

JudithR commented 1 year ago

Thx for the quick response. And a very reassuring answer. We used blobtools before and found the same off-taxon hits and also concluded that it wasn't likely real contamination. Please find attached the complete output for the 3 flagged scaffolds. possible_contaminants.txt

etvedte commented 1 year ago

Hi Judith,

Thanks for sending these. Indeed it looks like what I suspected...the sequences are predominantly called bird with small, repeat-laden spans called fish. We've done a fair amount of testing to come up with criteria to remove lower confidence contamination calls such as these from the final report. You should be good to proceed with your genome as is.