steineggerlab / conterminator

Detection of incorrectly labeled sequences across kingdoms
GNU General Public License v3.0
79 stars 7 forks source link

Some configurations of taxons in --kingdoms don't work #20

Open HeloiseMuller opened 1 year ago

HeloiseMuller commented 1 year ago

Hi,

I am trying to clean several de novo assemblies of insects from common contamination sources: human, bacteria and the database UniVec. For this, I concatenated all the fasta in one file called ' cat.fna' and I made a mapping file at the species level called 'ids_cat'.

I then ran: conterminator dna cat.fna ids_cat conterminator.results tmp_conterminator --threads 20 --blacklist 10239 --kingdoms '2,28384,9606,50557'

I changed the option kingdoms in order to look for contamination between bacteria, other sequences (which is the taxid I used for the sequences of UniVec), homo sapiens and insects. I do not which to look for contamination between by insect genomes. I do not need to ignore any taxa, so I just specified 10239 in the option blacklist in order to avoid the default taxons (which contain 28384, which I need).

Running this command, I get the following error message rescorediagonal step died. Interestingly, it works if I only specify 2,28384,9606 or 2,50557 or even 9606,50557 for kingdoms. Do you have any idea, why the combination I used do not work? 28384,50557 does not work either, but I get a different error message: Extractframes died

Moreover, I do not understand why in the output in which I used 2,50557, I have contamination between bacteria and human? Shouldn't it not even be looking for contamination at all between these two taxons in this configuration?

nohup_conterminator.txt

Thanks,

Héloïse

N.B. Just to let you know, it seems that conterminator cannot deal with some pattern of fasta identifier. The sequences of UniVec look like gnl|uv|X66730.1:1-2687-49. I had to change that to gnl uv|X66730.1:1-2687-49 and to write in the mapping file: gnl 28384. Otherwise I had the error: crosstaxonfilterorf step died.