Closed dmacguigan closed 3 days ago
Hello,
Can you send sample report rows that are generating this error?
Eric
This is a test row that I created to see what happens when FIX
is used for a contig containing both terminal and internal contamination.
ptg000023l_1 34307936 FIX 1..30,2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
The full accompany error:
While loading row 5 of the action report:
ptg000023l_1 34307936 FIX 1..30,2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
Fatal error: fasta.cpp:705 in ApplyActionReport(...): Assertion failed: action != "FIX" || ((ivl.pos != 1) && (stop_pos != seq_len))
Traceback (most recent call last):
File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 476, in <module>
sys.exit(main())
File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 465, in main
gx.run()
File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 347, in run
self.args.func(self)
File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 343, in run_clean_mode
self.run_gx()
File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 243, in run_gx
self.safe_exec(docker_args)
File "/projects/academic/tkrabben/software/NCBI_FCS/fcs.py", line 168, in safe_exec
subprocess.run(args, shell=False, check=True, text=True, stdout=sys.stdout, stderr=sys.stderr)
File "/cvmfs/soft.ccr.buffalo.edu/versions/2023.01/compat/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['singularity', 'exec', '--bind', '/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults/outputdir_adaptor:/report-volume/', '--bind', '/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults:/output-volume/', '--bind', '/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults:/contam-out-volume/', '/projects/academic/tkrabben/software/NCBI_FCS/fcs-gx.sif', '/app/bin/gx', 'clean-genome', '--action-report', '/report-volume/fcs_adaptor_report_mod.txt', '--output', '/output-volume/Svit_TJK-77.purged.adaptorClean.fasta', '--contam-fasta-out', '/contam-out-volume/Svit_TJK-77.purged.adaptorContam.fasta']' returned non-zero exit status 1.
OK I think I see what's happening. Action FIX cannot be applied at terminal positions.
Can you try this?
ptg000023l_1 34307936 ACTION_TRIM 1..30 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
ptg000023l_1 34307936 FIX 2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
or if that doesn't work:
ptg000023l_1 34307936 TRIM 1..30 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
ptg000023l_1 34307936 FIX 2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
Please also verify that the appropriate query sequence is getting masked. It should align to:
>NGB02000.1 Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
NNTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT
The following lines worked.
ptg000023l_1 34307936 ACTION_TRIM 1..30 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
ptg000023l_1 34307936 FIX 2057178..2057205,17873952..17873983 CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
screen log:
Applied 28 actions; 780 bps dropped; 0 bps lowercased; 60 bps hardmasked.
However, what I'm really hoping for is a new option like FIX
or ACTION_TRIM
that allows me to do this without manually splitting up a contig into multiple lines in the adaptor report file. Is that possible?
Unfortunately there is no way to handle this using a parameter currently.
What I would recommend for now is making a simple parsing script to handle this. I can't offer an official solution, but I'd bet ChatGPT could come up with something fairly easily. What you want is to:
Bottom line, I realize this workaround is a bit of a pain.
In the future we plan to design a pipeline where users can run both FCS-adaptor and FCS-GX with a single command. This will be an opportunity for us to discuss making the report formats of the two tools more consistent and potentially adding new parameters to allow more user control of cleaning. But I don't imagine this will be implemented in the near future, so my best suggestion for now is to follow the approach I described above.
Will do. I just wanted to confirm that this wasn't already an option before writing my own parser. I will post the code here once I've tested it.
Glad to hear that there are plans to merge to tools!
Thanks. Also be sure to validate the masked ranges like I mentioned above. clean genome
should be able to properly handle multiple rows per sequence such that an end TRIM
doesn't improperly shift the FIX
coordinates.
Here's my little solution.
# make modified version of adaptor report so that internal contamination is masked, not trimmed
# read each line of adaptor report into an array
outputdirPath="/vscratch/grp-tkrabben/MacGuigan/genome_assemblies/Sander_vitreus_TJK-77/HERRO/NCBI_FCS/hifiasm_defaults/outputdir_adaptor/"
while IFS=$'\t' read -r -a LINE
do
if [[ ${LINE[0]} == \#* ]] # if line is a header, skip
then
( IFS=$'\t'; echo "${LINE[*]}" > ${outputdirPath}/fcs_adaptor_report_mod.txt )
else
# parse the ranges
IFS=',' read -r -a ranges_array <<< "${LINE[3]}" # create array of regions to mask
for r in "${ranges_array[@]}" # for each range
do
if [[ ${r} == 1\.\.* || ${r} == *\.\.${LINE[1]} ]]
then # if the range is at the start or end of the sequence, set mode to trim (ACTION_TRIM)
echo -e "${LINE[0]}\t${LINE[1]}\tACTION_TRIM\t${r}\t${LINE[4]}" >> ${outputdirPath}/fcs_adaptor_report_mod.txt
else # if the range is in the middle of a sequence, set mode to mask (FIX)
echo -e "${LINE[0]}\t${LINE[1]}\tFIX\t${r}\t${LINE[4]}" >> ${outputdirPath}/fcs_adaptor_report_mod.txt
fi
done
fi
done <${outputdirPath}/fcs_adaptor_report.txt
# use fcs_adaptor_report_mod.txt for trimming and masking with fcs.py
It seems to work. I've checked a few internal masked regions by eye, and they seem to be in the correct spot. For example:
This is a feature request for FCS-adaptor (and FCS-GX).
I understand how
ACTION_TRIM
andFIX
can be set manually in thefcs_adaptor_report.txt
to trim contaminant ends or split/mask internal contamination. However, a single contig sometimes contains both terminal and internal contamination. In that case, I do not see an option that would trim the ends and hard mask internally. UsingACTION_TRIM
will split the contig, and ifFIX
finds terminal contamination it fails with the error:Fatal error: fasta.cpp:705 in ApplyActionReport(...): Assertion failed: action != "FIX" || ((ivl.pos != 1) && (stop_pos != seq_len))
I would ideally like a single option to trim contaminant ends and mask internal contamination. This will make it much easier to automate contaminant screening and alleviate the need to manually tweak the adaptor file.
Is there a way to do this currently?