lisavader / plasmidEC

An ensemble of plasmid classification tools
MIT License
3 stars 2 forks source link

Problem with plasmid_contigs.fasta #10

Closed varunshamanna closed 2 years ago

varunshamanna commented 2 years ago

The contigs are not complete

image

I added seqtk to get the fasta and I wanted the file name in the contigs. So add these things in the write_plasmid_contigs.sh

for contig in $plasmid_contigs; do
        echo $contig >> $out_dir/contig.txt
        #grep -A 1 $contig $input >> $out_dir/plasmid_contigs.fasta
done

TAG="$(basename $input .fasta)"
seqtk subseq $input $out_dir/contig.txt >> $out_dir/"${TAG}"_plasmids.fasta
sed -i "s/^>/>${TAG}_/" $out_dir/"${TAG}"_plasmids.fasta

Because the -A 1 in grep does grep only subsequent line after the header.

-A, --after-context=NUM   print NUM lines of trailing context

The contigs from spades are never in a single line.

lisavader commented 2 years ago

Hi Varun, I had indeed not realised that this script doesn't work for contigs spanning multiple lines, thanks for notifying! It has been adjusted now (584750d9625a7f78457c301744e1d22982041daa)

varunshamanna commented 2 years ago

Thank you for the updates. I just wanted to check what are your thoughts on running the tool for multiple samples? And combining the results together?

lisavader commented 2 years ago

Actually, in a previous version of the tool a directory was used as input, and plasmidEC would output the combined result for all files in this directory. However, I changed this to a one file in --> one file out approach, so that the tool can be more easily integrated in pipelines (e.g. Snakemake) and users can run multiple jobs in parallel, using whatever job scheduler they like. So more possibilities for costumisation, with the cost that you'll have concatenate the results of multiple samples yourself.