ncbi / fcs-gx

Foreign Contamination Screening - GX source code
Other
10 stars 4 forks source link

Merging of fcs-gx and fcs-adaptor reports #6

Open fgvieira opened 1 month ago

fgvieira commented 1 month ago

I need to run both fcs-gx and fcs-adaptor on a lot of genomes. To speed-up things, I was thinking if I could run them both in parallel (on the original genome fasta):

genome.fas.gz -> fcs_gx
genome.fas.gz -> fcs_adaptor

merge the reports:

cat fcs_gx_report.txt fcs_adaptor_report.txt > fcs_report.txt

and run gx clean-genome on the merged report:

gx clean-genome --input genome.fas.gz --action-report fcs_report.txt | bgzip > genome.clean.fas.gz

I tried and it works fine if there are no overlaps but, if there are overlapping intervals it gives an error:

NB: fasta.cpp:786 in operator()(...): while processing action on CAJQZP010001473.1 @ [1..76092]len:76092
Fatal error: fasta.cpp:802 in ApplyActionReport(...): Assertion failed: action.ivl.endpos() <= (int64_t)fasta_seq.seq.size() + 1
terminate called after throwing an instance of 'std::runtime_error'
  what():  serial_util.hpp:300 in close(...): 

Error: pipe_istream command 'gzip -c -d < 'eukaryota/temp/ncbi/110799/GCA_907164705.1.fas.gz'' exited with code 36096
Aborted (core dumped)

Since reports have different formats. How can I merge them?

##[["FCS genome report", 2, 1], {"git-rev": "v0.5.4-5-g5dfd516", "run-date": "Mon Aug  5 13:47:56 2024", "db": {"build-date": "2023-01-24", "seqs": 3025824, "Gbp": 709.264}, "run-info": {"agg-cvg": 0.280242, "asserted-div": "anml:insects", [...]
#seq_id start_pos       end_pos seq_len action  div     agg_cont_cov    top_tax_name
CAJQZP010001473.1       1       76092   76092   EXCLUDE prok:a-proteobacteria   15      Ponticoccus marisrubri

and

#accession      length  action  range   name
CAJQZP010000764.1       988538  ACTION_TRIM     499974..500000  CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB00875.1:Ion Xpress Barcode 3 A Adapter
CAJQZP010000899.1       221     ACTION_EXCLUDE          CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked
CAJQZP010000902.1       278     ACTION_TRIM     259..278        CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked

thanks,

etvedte commented 1 month ago

The problem is, as you observed, the reports have different formats. gx clean-genome can interpret both reports separately but not simultaneously.

I did some testing here. Similar error messages are present when there is a sequence marked by GX as EXCLUDE and the same sequence is marked up as contaminated in the FCS-adaptor report with ACTION_EXCLUDE or ACTION TRIM.

BUT, I did a dummy example when the merged report has a GX-style row with a TRIM and further down there is an adaptor-style row with a ACTION_TRIM. There is no apparent error message printed. But there is no sanity checking on the length column (column 2) and so what ends up happening is there is over-cleaning.

CAJQZP010001473.1   1   10000   76092   TRIM    prok:a-proteobacteria   15  Ponticoccus marisrubri
CAJQZP010001473.1   76092   ACTION_TRIM 1..278  CONTAMINATION_SOURCE_TYPE_ADAPTOR:NGB02000.1:Oxford Nanopore Technologies Rapid Adapter (RA) Ligation Adapter top (LA) Native Adaptor top (NA) polyT masked

Applied 2 actions; 10278 bps dropped; 0 bps lowercased; 0 bps hardmasked.

My best suggestion for now is to run these tools sequentially. As an example, you could run FCS-adaptor, gx clean-genome on the output, run FCS-GX on the first cleaned FASTA, then gx clean-genome on that output. Eventually we plan to integrate the tools into a single FCS pipeline which will produce a consolidated, non-overlapping report.

fgvieira commented 1 month ago

I need to decontaminated several thousand genomes (~18k euk + 500k bact), and the sequential approach you suggest would require an extra write to disk.

Are there any differences between EXCLUDE/ACTION_EXCLUDE and TRIM/ACTION_TRIM? If not, can I convert all ACTION_EXCLUDE into just EXCLUDE?

I tried merging both, but it is a bit hacky. First, convert fcs_adaptor output into fcs_gx format:

grep -v "#" fcs_adaptor.report.txt | awk -F"\t" '$4==""{$4="1.."$2} {OFS="\t"; print}' | awk -F"\t" '{split($4,coord,"."); split($5,desc,":"); print $1"\t"coord[1]"\t"coord[3]"\t"$2"\t"$3"\t"desc[1]":"desc[2]"\t\t"desc[3]}' > fcs_adaptor.report_fix.txt

then merge overlaps (giving preference to EXCLUDE > ACTION_EXCLUDE > TRIM > ACTION_TRIM > FIX > ACTION_FIX > [REST]):

grep -hv '#' fcs_gx.report.txt fcs_adaptor.report_fix.txt | awk -F"\t" '{{OFS="\t"; $9=0}} $5~/EXCLUDE/{{$9=10}} $5~/TRIM/{{$9=5}} $5~/FIX/{{$9=3}} $5~/ACTION_/{{$9=$9-1}} !/#/{{print}}' | bedtools sort | bedtools merge -c 4,5,6,7,8,9 -o collapse | perl -MList::Util=max -MList::MoreUtils=firstidx -F"\t" -an -e '@r=split(",",$F[8]); if($#r>0){$fi=firstidx {max(@r)} @r}; foreach $i (3,4,5,6,7){@a=split(",",$F[$i]); $F[$i]=$a[$fi]}; pop(@F); print(join("\t",@F)."\n")' | gx clean-genome --input {input.fas} --action-report /dev/stdin > output.fas

Would something like this make sense?

etvedte commented 1 month ago

The commands you posted appear to work, at least for the simpler example I used.

However, there are some discrepancies in the action column that will need to be accounted for depending on your desired results: GX EXCLUDE = adaptor ACTION_EXCLUDE. Straightforward and should be priority for merging overlaps. GX TRIM = adaptor ACTION_TRIM when the contamination is at one of the two terminal ends of a sequence GX FIX : gx clean-genome will hardmask these internal contamination regions adaptor ACTION_TRIM : gx clean-genome will split contigs at these internal contamination regions There is no such thing as adaptor ACTION_FIX SPLIT is synonymous to ACTION_TRIM

So if you want to hardmask at identified internal contaminants, in the FCS-adaptor conversion command you would want to change ACTION_TRIM > FIXat internal contamination positions, then do the merging.

If you want to split at internal contaminants, convert FIX > SPLIT in the GX report, then do the merging.

gx clean-genome will throw an error if you have a TRIM at internal positions, but with the prioritization strategy you use above, I think the rare case of TRIM overlap with FIX would correctly assign it as TRIM be able to handle this.

fgvieira commented 1 month ago

Ok, so if I want to keep the default behaviour (adaptor split on internal contaminants, and gx mask), I can convert the fcs_adaptor_report file as:

grep -v "#" fcs_adaptor_report.txt | awk -F"\t" '$4==""{$4="1.."$2} {OFS="\t"; sub("ACTION_","",$3); print}' | awk -F"\t" '{split($4,coord,"."); split($5,desc,":")} $3=="TRIM" && coord[1]>1 && coord[3]<$2 {$3="SPLIT"} {print $1"\t"coord[1]"\t"coord[3]"\t"$2"\t"$3"\t"desc[1]":"desc[2]"\t\t"desc[3]}' > fcs_adaptor_report.fix.txt

and then merge (now a bit simpler) with fcs_gx_report (giving preference to EXCLUDE > TRIM > SPLIT > FIX > [REST]):

grep -hv '#' GCA_907164705.1.fcs_gx_report.txt fcs_adaptor_report.fix.txt | awk -F"\t" '{{OFS="\t"; $9=1}} $5~/EXCLUDE/{{$9=5}} $5~/TRIM/{{$9=4}} $5~/SPLIT/{{$9=3}} $5~/FIX/{{$9=2}} !/#/{{print}}' | bedtools sort | bedtools merge -c 4,5,6,7,8,9 -o collapse | perl -MList::Util=max -MList::MoreUtils=firstidx -F"\t" -an -e '@r=split(",",$F[8]); if($#r>0){$fi=firstidx {max(@r)} @r}; foreach $i (3,4,5,6,7){@a=split(",",$F[$i]); $F[$i]=$a[$fi]}; pop(@F); print(join("\t",@F)."\n")' | gx clean-genome --input {input.fas} --action-report /dev/stdin > output.fas

Would this reproduce the original fcs behaviour?

etvedte commented 1 month ago

Would this reproduce the original fcs behaviour? Yes, in the sense that it is using the default corrective actions. But the reality is a bit more complicated.

For NCBI submissions, we wouldn't take corrective actions on internal contaminants. We provide the report to the submitter, and it would be up to them to analyze the results, look at secondary data sources (e.g. reads spanning the questionable regions, HiC contact maps, etc.) , determine whether a sequence should be masked or split at that region, then resubmit the corrected genome.

So if you are running many genomes, the best course of action would be to manually evaluate GX FIX / adaptor internal ACTION_TRIM and then decide how to handle with gx clean-genome. These cases should be relatively rare. If internal calls are sporadic, its probably OK to just mask so long as you have external evidence for connectivity in the surrounding regions. Individual genomes with many internal contaminants should be evaluated more closely for assembly issues.