ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
101 stars 13 forks source link

Using FCS-GX to split/mask internal adaptor sequences #67

Closed bndaniel closed 3 months ago

bndaniel commented 7 months ago

Hello,

I have been working on submitting a genome to NCBI, and along with many others have internal contamination by adaptors (as identified by FCS-adaptor). As recommended, I am trying to use the output from FCS-adaptor as an input to FCS-GX to split or mask the internal adaptors. I made a new report.txt file with action "FIX" or "SPLIT" and get "Applied 0 actions; 0 bps dropped; 0 bps hardmasked" back with no modification to the genome. I noticed someone is having a similar issue in issue #66.

I have attached the modified report.txt file - is there any issue with filling columns with "NA"? is there information I am missing? Thanks for the help. Looking forward to seeing a SPLIT or internal trim function in FCS-adaptor soon!

adaptor_contamination_fcsgx.txt

etvedte commented 7 months ago

Hello,

Can you post the exact commands and input you used for FCS-adaptor screening and subsequent cleaning? By "output from FCS-adaptor as an input to FCS-GX" I am assuming you mean here the adaptor report used on the uncleaned FASTA.

is there any issue with filling columns with "NA"

I should test this sometime. Don't think it should be an issue, though.

There is quite a bit of adaptor contamination here. What adaptor sequence is it hitting? Are these at contig boundaries? You might want to look at the reads mapped to the sequence in a viewer to see whether these sequences are making false joins in your assembly...you would want to split instead of hardmask.

Eric

bndaniel commented 7 months ago

Hi Eric,

The cleaning and report was done when I submitted the genome to NCBI - which gave me a report (the modified version is the one I attached above) and a cleaned genome (removed contamination at contig boundaries). As you can see in the report.txt file, all adaptor contamination is found at internal sequences within contigs - so my main objective is to split the contigs at these contamination sites, and re-run FCS-adaptor to remove these sequences at contig boundaries.

These adaptor sequences were mostly from cDNA synthesis kit - you can see the original report attached.

Here is the command I am using with the genome provided by NCBI cleaning (decontam_genome.fsa) and the modified adaptor contamination .txt file. python3 ./fcs.py clean genome -i ./decontam_genome.fsa --action-report ./adaptor_contamination_fcsgx.txt --output ./clean_genome.fasta --contam-fasta-out ./contam.fasta

RemainingContamination.txt

etvedte commented 7 months ago

Can you forward the email from the NCBI submissions team to eric.tvedte@nih.gov?

bndaniel commented 5 months ago

This remains unresolved, I have attempted to re-run FCS_adaptor on my original genome and attempted to use the output for using TRIM on internal adaptors, but the output from FCS_adaptor is not the same as what is used by fcs.py. The wiki indicates that you can use the fcs_adaptor_report.txt for fcs.py but I keep getting "Fatal error (St13runtime_error): util.cpp:212 in ConsumeMetalineHeader(...): Expected the first line of the input file to begin with header:

[["FCS genome report",2,

found:

accession length action range name"

etvedte commented 5 months ago

Hi Ben,

Sorry this isn't working as expected. Catching up with the previous communications...

If you are running FCS-adaptor on your own, the resulting fcs_adaptor_report.txt should have one row per sequence (not like the single action per row in the GX -style output). So scaffold_1 should look similar to this:

scaffold_1      3322941 ACTION_TRIM     139816..139841,140342..140366,338804..338829,339330..339353,433763..433788,546394..546418,889324..889348,953469..953496,953597..953623,1125514..1125536,1125637..1125661,1234579..1234604,1235105..1235129,1274509..1274533,1408151..1408175,1408676..1408701,1606116..1606140,1776689..1776714,1777215..1777241,1822912..1822936,1823437..1823460,1839569..1839595,2050714..2050739,2050840..2050864,2081692..2081717,2457137..2457161,2721378..2721403,2721504..2721529,2745521..2745546,2746147..2746172,2886862..2886886,2905448..2905474,3158039..3158057,3158558..3158583      CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked

When you use fcs.py clean genome with FCS-adaptor style reports, make sure you are doing the following:

  1. Make sure you are using the latest version of fcs.py
  2. Use the original FASTA that you ran with run_fcsadaptor.sh, not the sequences in cleaned_sequences. See here for more details.
  3. Don't change anything in the FCS-adaptor report if you want to split on internal contaminants. You do need to change ACTION_TRIM to FIX if you want to mask.
  4. Run
    cat input.fa | python3 ./fcs.py clean genome --action-report ./outputdir/fcs_adaptor_report.txt --output clean.fasta --contam-fasta-out contam.fasta

When the clean run is completed, you should be able to see coordinates in the FASTA header corresponding to the split locations. So the first few splits produce the following...

>scaffold_1~433789..546393
>scaffold_1~339354..433762
>scaffold_1~338830..339329
>scaffold_1~140367..338803
>scaffold_1~139842..140341
>scaffold_1~1..139815  

If you are still having this error after doing all of the above, let me know.

bndaniel commented 5 months ago

Hi Eric,

Thanks for the reply. I have attempted to do this twice now with fresh downloads of fcs.py and run_fcsadaptor.sh and am still getting the error: "Fatal error (St13runtime_error): util.cpp:212 in ConsumeMetalineHeader(...): Expected the first line of the input file to begin with header:

[["FCS genome report",2,

found:

accession length action range name”

Let me know what you think is the best next step!

Ben

On Apr 23, 2024, at 1:43 PM, Eric Tvedte @.***> wrote:

Hi Ben,

Sorry this isn't working as expected. Catching up with the previous communications...

If you are running FCS-adaptor on your own, the resulting fcs_adaptor_report.txt should have one row per sequence (not like the single action per row in the GX -style output). So scaffold_1 should look similar to this:

scaffold_1 3322941 ACTION_TRIM 139816..139841,140342..140366,338804..338829,339330..339353,433763..433788,546394..546418,889324..889348,953469..953496,953597..953623,1125514..1125536,1125637..1125661,1234579..1234604,1235105..1235129,1274509..1274533,1408151..1408175,1408676..1408701,1606116..1606140,1776689..1776714,1777215..1777241,1822912..1822936,1823437..1823460,1839569..1839595,2050714..2050739,2050840..2050864,2081692..2081717,2457137..2457161,2721378..2721403,2721504..2721529,2745521..2745546,2746147..2746172,2886862..2886886,2905448..2905474,3158039..3158057,3158558..3158583 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked When you use fcs.py clean genome with FCS-adaptor style reports, make sure you are doing the following:

Make sure you are using the latest version of fcs.py Use the original FASTA that you ran with run_fcsadaptor.sh, not the sequences in cleaned_sequences. See here https://github.com/ncbi/fcs/wiki/FCS-adaptor-quickstart#clean-the-genome for more details. Don't change anything in the FCS-adaptor report if you want to split on internal contaminants. You do need to change ACTION_TRIM to FIX if you want to mask. Run cat input.fa | python3 ./fcs.py clean genome --action-report ./outputdir/fcs_adaptor_report.txt --output clean.fasta --contam-fasta-out contam.fasta When the clean run is completed, you should be able to see coordinates in the FASTA header corresponding to the split locations. So the first few splits produce the following...

scaffold_1~433789..546393 scaffold_1~339354..433762 scaffold_1~338830..339329 scaffold_1~140367..338803 scaffold_1~139842..140341 scaffold_1~1..139815
If you are still having this error after doing all of the above, let me know.

— Reply to this email directly, view it on GitHub https://github.com/ncbi/fcs/issues/67#issuecomment-2073174041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AREROTQ76CRSGYSWDS7MWVLY62TUPAVCNFSM6AAAAABC5DVFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZTGE3TIMBUGE. You are receiving this because you authored the thread.

bndaniel commented 5 months ago

The output from run_fcsadaptor.sh looks like:

accession length action range name

scaffold_10000 49619 ACTION_TRIM 49595..49619 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence scaffold_10001 49611 ACTION_TRIM 49587..49611 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence scaffold_10002 49608 ACTION_TRIM 1..25 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence scaffold_10003 50100 ACTION_TRIM 17314..17339,17840..17864,50075..50100 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence

Which matches the expected output as described here https://github.com/ncbi/fcs/raw/main/examples/FCS_combo_test.fcs_adaptor_report.txt

Yet, fcs.py seems to want an input more similar to the FCS-GX output like https://github.com/ncbi/fcs/raw/main/examples/FCS_combo_test.fcs_gx_report.txt

Best, Ben

On Apr 23, 2024, at 6:24 PM, Ben Daniels @.***> wrote:

Hi Eric,

Thanks for the reply. I have attempted to do this twice now with fresh downloads of fcs.py and run_fcsadaptor.sh and am still getting the error: "Fatal error (St13runtime_error): util.cpp:212 in ConsumeMetalineHeader(...): Expected the first line of the input file to begin with header:

[["FCS genome report",2,

found:

accession length action range name”

Let me know what you think is the best next step!

Ben

On Apr 23, 2024, at 1:43 PM, Eric Tvedte @.***> wrote:

Hi Ben,

Sorry this isn't working as expected. Catching up with the previous communications...

If you are running FCS-adaptor on your own, the resulting fcs_adaptor_report.txt should have one row per sequence (not like the single action per row in the GX -style output). So scaffold_1 should look similar to this:

scaffold_1 3322941 ACTION_TRIM 139816..139841,140342..140366,338804..338829,339330..339353,433763..433788,546394..546418,889324..889348,953469..953496,953597..953623,1125514..1125536,1125637..1125661,1234579..1234604,1235105..1235129,1274509..1274533,1408151..1408175,1408676..1408701,1606116..1606140,1776689..1776714,1777215..1777241,1822912..1822936,1823437..1823460,1839569..1839595,2050714..2050739,2050840..2050864,2081692..2081717,2457137..2457161,2721378..2721403,2721504..2721529,2745521..2745546,2746147..2746172,2886862..2886886,2905448..2905474,3158039..3158057,3158558..3158583 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00596.1:Evrogen Mint CDS-Gsu adapter polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00577.1:CLONTECH 3'-RACE CDS Primer A polyT masked, contains PacBio ULI adapter subsequence; CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked When you use fcs.py clean genome with FCS-adaptor style reports, make sure you are doing the following:

Make sure you are using the latest version of fcs.py Use the original FASTA that you ran with run_fcsadaptor.sh, not the sequences in cleaned_sequences. See here https://github.com/ncbi/fcs/wiki/FCS-adaptor-quickstart#clean-the-genome for more details. Don't change anything in the FCS-adaptor report if you want to split on internal contaminants. You do need to change ACTION_TRIM to FIX if you want to mask. Run cat input.fa | python3 ./fcs.py clean genome --action-report ./outputdir/fcs_adaptor_report.txt --output clean.fasta --contam-fasta-out contam.fasta When the clean run is completed, you should be able to see coordinates in the FASTA header corresponding to the split locations. So the first few splits produce the following...

scaffold_1~433789..546393 scaffold_1~339354..433762 scaffold_1~338830..339329 scaffold_1~140367..338803 scaffold_1~139842..140341 scaffold_1~1..139815
If you are still having this error after doing all of the above, let me know.

— Reply to this email directly, view it on GitHub https://github.com/ncbi/fcs/issues/67#issuecomment-2073174041, or unsubscribe https://github.com/notifications/unsubscribe-auth/AREROTQ76CRSGYSWDS7MWVLY62TUPAVCNFSM6AAAAABC5DVFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZTGE3TIMBUGE. You are receiving this because you authored the thread.

etvedte commented 5 months ago

Are you using Docker or Singularity?

fcs.py should be able to handle both formats. I don't recognize this error message in the current code. Will continue to look.

In the meantime, can you try running fcs.py clean genome on that example from the wiki? You linked the adaptor report above, and the FASTA is retrievable from zenodo. I tested this on Docker just last week and got it to work.

If this works, it is suggesting there is something different/conflicting with your adaptor report. If this doesn't work, this is some kind of software/image issue.

bndaniel commented 5 months ago

Hi Eric,

I had previous docker images from when I first used FCS that I needed to remove. I got fcs.py to run, but it seems to have converted all internal adapter contamination into N’s rather than split (config number is the same). I checked the adaptor report and all actions have ACTION_TRIM and not FIX… Let me know if I am missing something. Thanks for all your help on this.

Best, Ben

On Apr 24, 2024, at 5:31 AM, Eric Tvedte @.***> wrote:

Are you using Docker or Singularity?

fcs.py should be able to handle both formats. I don't recognize this error message in the current code. Will continue to look.

In the meantime, can you try running fcs.py clean genome on that example from the wiki? You linked the adaptor report above, and the FASTA is retrievable from zenodo. I tested this on Docker just last week and got it to work.

If this works, it is suggesting there is something different/conflicting with your adaptor report. If this doesn't work, this is some kind of software/image issue.

— Reply to this email directly, view it on GitHub https://github.com/ncbi/fcs/issues/67#issuecomment-2074838911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AREROTR25AIPGCPRCRWRPTLY66Q3XAVCNFSM6AAAAABC5DVFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUHAZTQOJRGE. You are receiving this because you authored the thread.

etvedte commented 5 months ago

OK, you're using Docker. Can you verify that the version in the docker image is up-to-date when running fcs.py commands? It should be v0.5.0.

Also, please try using the example from the wiki. That has cases with internal ACTION_TRIMs called by FCS-adaptor and should default to splitting with fcs.py clean genome. See what happens.

Hannah1746 commented 4 months ago

I believe I am running into the same issue on my end. Did this ever get resolved?

etvedte commented 3 months ago

@Hannah1746 we are not aware of any issues with splitting vs. masking in the current release. Please verify that you are using the v0.5.0 release. It would be helpful if you could provide additional details:

etvedte commented 3 months ago

There is a new FCS v0.5.4 release that can be tested. Make sure you are using the latest release when screening/cleaning genomes. There weren't any changes relevant to the content of this GitHub issue, but we haven't received any additional information that would help us to troubleshoot. If you're still having this problem with v0.5.4, feel free to re-open the issue.