josuebarrera / GenEra

genEra is a fast and easy-to-use command-line tool that estimates the age of the last common ancestor of protein-coding gene families.
GNU General Public License v3.0
46 stars 6 forks source link

Bug-fix for ';' seperated tax-ids gereated by diamond when using the NR database. #22

Closed Eulium closed 9 months ago

Eulium commented 9 months ago

Porpossale to fix bug in Erassignmet possible resulting in missing lineage information

BUG encountered:

When using NCBI's NR database with Diamond the output file can contain rows with multiple tax ids. These are separated by a ';' (see below, last column). In line 50 of the Erassignment script the diamond output is piped into multiple awk and sed commands which extract the tax id column and use's its entries as key. In rows with multiple tax id's the ';' separated values will be used as a key in total, resulting in no matching rows from the phylostrata file. Any tax id that is only present in rows with multiple tax id's will not be used in later procedures of Erassignment and may cause wrong or incomplete results as phylostrat linage information matching tax id's found only in rows with ';' would not be present in the tmp_phylostrata file.

Excerpt from a Diamond output file which will reproduce the bug:

NP_001268798.1 XP_042691240.1 4.05e-141 402 9002;30410;64668 NP_001268798.1 XP_037265533.1 1.20e-133 384 8954;120794;345155;345164 NP_001268798.1 XP_029871740.1 3.43e-133 382 8957;202280;223781 NP_001268798.1 XP_057257371.1 5.69e-132 379 12930;35540 NP_001268798.1 XP_029871741.1 1.46e-131 378 8957;202280;223781´

Buggy code

Line 50 awk -v GENE="$GENE" '{ if ($1 == GENE) print $NF }' ${TMP_PATH}/tmp_${GENE}.bout | sort -u -T ${TMP_PATH} | sed '/^[[:space:]]*$/d' | awk -F "," 'FNR==NR{ a[$1]=$0;next } ($1 in a)' - ${NCBITAX} | grep "$OLDEST" > %{TMP_PATH}/tmp_phylostrata

Proposed change

Add sed 's/;/\n/g command into the pipeline to split any ';' separated tax id's into multiple rows. This will result in them being split into single lines and properly used as single keys in the following awk command Line 50 awk -v GENE="$GENE" '{ if ($1 == GENE) print $NF }' ${TMP_PATH}/tmp_${GENE}.bout | sed 's/;/\n/g' | sort -u -T ${TMP_PATH} | sed '/^[[:space:]]*$/d' | awk -F "," 'FNR==NR{ a[$1]=$0;next } ($1 in a)' - ${NCBITAX} | grep "$OLDEST" > %{TMP_PATH}/tmp_phylostrata

josuebarrera commented 9 months ago

Looks very straightforward, thanks for adding this solution!