ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

[FEATURE REQUEST]: Single action to trim contaminant ends and mask internal contamination #88

Closed dmacguigan closed 3 days ago

dmacguigan commented 3 days ago

This is a feature request for FCS-adaptor (and FCS-GX).

I understand how ACTION_TRIM and FIX can be set manually in the fcs_adaptor_report.txt to trim contaminant ends or split/mask internal contamination. However, a single contig sometimes contains both terminal and internal contamination. In that case, I do not see an option that would trim the ends and hard mask internally. Using ACTION_TRIM will split the contig, and if FIX finds terminal contamination it fails with the error:

Fatal error: fasta.cpp:705 in ApplyActionReport(...): Assertion failed: action != "FIX" || ((ivl.pos != 1) && (stop_pos != seq_len))

I would ideally like a single option to trim contaminant ends and mask internal contamination. This will make it much easier to automate contaminant screening and alleviate the need to manually tweak the adaptor file.

Is there a way to do this currently?

etvedte commented 3 days ago

Hello,

Can you send sample report rows that are generating this error?

Eric

dmacguigan commented 3 days ago

This is a test row that I created to see what happens when FIX is used for a contig containing both terminal and internal contamination.

ptg000023l_1    34307936    FIX 1..30,2057178..2057205,17873952..17873983   CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
dmacguigan commented 3 days ago

The full accompany error:

While loading row 5 of the action report:
ptg000023l_1    34307936    FIX 1..30,2057178..2057205,17873952..17873983   CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked 
Fatal error: fasta.cpp:705 in ApplyActionReport(...): Assertion failed: action != "FIX" || ((ivl.pos != 1) && (stop_pos != seq_len))
Traceback (most recent call last):
  File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 476, in <module>
    sys.exit(main())
  File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 465, in main
    gx.run()
  File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 347, in run
    self.args.func(self)
  File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 343, in run_clean_mode
    self.run_gx()
  File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 243, in run_gx
    self.safe_exec(docker_args)
  File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 168, in safe_exec
    subprocess.run(args, shell=False, check=True, text=True, stdout=sys.stdout, stderr=sys.stderr)
  File "/cvmfs/soft.ccr.buffalo.edu/versions/2023.01/compat/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['singularity', 'exec', '--bind', '/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults/outputdir_adaptor:/report-volume/', '--bind', '/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults:/output-volume/', '--bind', '/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults:/contam-out-volume/', '/projects/academic/tkrabben/software/NCBI_FCS/fcs-gx.sif', '/app/bin/gx', 'clean-genome', '--action-report', '/report-volume/fcs_adaptor_report_mod.txt', '--output', '/output-volume/Svit_TJK-77.purged.adaptorClean.fasta', '--contam-fasta-out', '/contam-out-volume/Svit_TJK-77.purged.adaptorContam.fasta']' returned non-zero exit status 1.
etvedte commented 3 days ago

OK I think I see what's happening. Action FIX cannot be applied at terminal positions.

Can you try this?

ptg000023l_1    34307936    ACTION_TRIM 1..30   CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
ptg000023l_1    34307936    FIX 2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked

or if that doesn't work:

ptg000023l_1    34307936    TRIM    1..30   CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
ptg000023l_1    34307936    FIX 2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked

Please also verify that the appropriate query sequence is getting masked. It should align to:

>NGB02000.1 Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
NNTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT
dmacguigan commented 3 days ago

The following lines worked.

ptg000023l_1    34307936    ACTION_TRIM 1..30   CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
ptg000023l_1    34307936    FIX 2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked

screen log:

Applied 28 actions; 780 bps dropped; 0 bps lowercased; 60 bps hardmasked.

However, what I'm really hoping for is a new option like FIX or ACTION_TRIM that allows me to do this without manually splitting up a contig into multiple lines in the adaptor report file. Is that possible?

etvedte commented 3 days ago

Unfortunately there is no way to handle this using a parameter currently.

What I would recommend for now is making a simple parsing script to handle this. I can't offer an official solution, but I'd bet ChatGPT could come up with something fairly easily. What you want is to:

etvedte commented 3 days ago

Bottom line, I realize this workaround is a bit of a pain.

In the future we plan to design a pipeline where users can run both FCS-adaptor and FCS-GX with a single command. This will be an opportunity for us to discuss making the report formats of the two tools more consistent and potentially adding new parameters to allow more user control of cleaning. But I don't imagine this will be implemented in the near future, so my best suggestion for now is to follow the approach I described above.

dmacguigan commented 3 days ago

Will do. I just wanted to confirm that this wasn't already an option before writing my own parser. I will post the code here once I've tested it.

Glad to hear that there are plans to merge to tools!

etvedte commented 3 days ago

Thanks. Also be sure to validate the masked ranges like I mentioned above. clean genome should be able to properly handle multiple rows per sequence such that an end TRIM doesn't improperly shift the FIXcoordinates.

dmacguigan commented 3 days ago

Here's my little solution.

# make modified version of adaptor report so that internal contamination is masked, not trimmed
# read each line of adaptor report into an array
outputdirPath="/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults/outputdir_adaptor/"
while IFS=$'\t' read -r -a LINE
do
  if [[ ${LINE[0]} == \#* ]] # if line is a header, skip
  then
    ( IFS=$'\t'; echo "${LINE[*]}" > ${outputdirPath}/fcs_adaptor_report_mod.txt ) 
  else
    # parse the ranges
    IFS=',' read -r -a ranges_array <<< "${LINE[3]}" # create array of regions to mask
    for r in "${ranges_array[@]}" # for each range
    do
      if [[ ${r} == 1\.\.* || ${r} == *\.\.${LINE[1]} ]] 
      then # if the range is at the start or end of the sequence, set mode to trim (ACTION_TRIM)
        echo -e "${LINE[0]}\t${LINE[1]}\tACTION_TRIM\t${r}\t${LINE[4]}" >> ${outputdirPath}/fcs_adaptor_report_mod.txt
      else # if the range is in the middle of a sequence, set mode to mask (FIX)
        echo -e "${LINE[0]}\t${LINE[1]}\tFIX\t${r}\t${LINE[4]}" >> ${outputdirPath}/fcs_adaptor_report_mod.txt
      fi
    done
  fi
done <${outputdirPath}/fcs_adaptor_report.txt
# use fcs_adaptor_report_mod.txt for trimming and masking with fcs.py

It seems to work. I've checked a few internal masked regions by eye, and they seem to be in the correct spot. For example:

image