miwipe / ngsLCA

GNU General Public License v3.0
9 stars 5 forks source link

Taxa rank otu tables #24

Closed valentinavan closed 2 years ago

valentinavan commented 2 years ago

Hi,

Is it possible to combine each taxa rank otutable into one rather than having them separate (one for species, one for genera etc)?

I can see that the program creates files containing all the ranks put together (kraken style), like "complete profile" file or the one in the taxa_groups folder but the counts, in my case, do not match with the counts in the separate files, therefore I am confused.

Thanks

miwipe commented 2 years ago

Hi, Thank you for this questions @valentinavan. In the ngsLCA R package you will find the option ngsLCA_group this groups all reads per taxonomic level classified below a user-defined taxonomic level e.g., kingdoms (Bacteria, Archaea or Eukaryota) or it could also just be plants (Viridiplantae) or like. Could this be what you are looking for?

I just reran the part of the test data and I do get matching counts. For example here I compare the complete_profile output with the taxa_groups/Viridiplantae.txt file.

complete_profile.txt:238069 Saliceae tribe 16 7 10 complete_profile.txt:3688 Salicaceae family 106 101 102 complete_profile.txt:40685 Salix genus 42 38 26 complete_profile.txt:75712 Salix interior species 3 1 9

taxa_groups/Viridiplantae.txt:238069 Saliceae tribe 16 7 10 taxa_groups/Viridiplantae.txt:3688 Salicaceae family 106 101 102 taxa_groups/Viridiplantae.txt:40685 Salix genus 42 38 26 taxa_groups/Viridiplantae.txt:75712 Salix interior species 3 1 9

The number in both table are reads assigned to the specific taxonomic level. In the Taxonomic_ranks folders, the counts are cumulative as you when you go higher taxonomically.

Could you provide me with an example of where you find conflicting counts? I hope this helps? Mikkel

valentinavan commented 2 years ago

Hi Mikkel,

Thanks for replying. Yes I know about this option, I have used it. Complete_profile and taxa_groups/Viridiplantae.txt not always match in my case. For example, the phyla match but at the genus level, complete_profile has 1 genus more and I did only specify the plants using this option:

ngsLCA_group(path=working_directory,
             run=run_name,
             group.name="Viridiplantae", 
             threshold.perGroup=0)

but more importantly if I look at the single .txt file for each Viridiplantae rank (taxa_ranks folder) there is a big difference in number of taxa. For example I have: 5 genera in taxa_groups/Viridiplantae.txt 6 genera in complete_profile.txt 1762 genera in taxa_ranks/Viridiplantae_genus.txt 1982 genera in taxa_ranks/all_taxa_genus.txt

Which txt file should I use for my downstream analyses? And what am I doing wrong for having these big mismatches?

Thanks

miwipe commented 2 years ago

Hi again,

That looks really weird @valentinavan, how are you counting the number of genera? In the numbers I give you I use a subset of the test files: SPL_015_1444.fq.plastids.bam SPL_055_3112.fq.plastids.bam SPL_113_5430.fq.plastids.bam

That can be downloaded here https://sid.erda.dk/sharelink/AqTXI6E560, they are small files, and can be run even on smaller computers.

In most cases we would expect the numbers to decrease at higher taxonomic levels, but there can be cases where reads are only classified at genus level and does not go to species level, hence you can have additional counts at genus. See example here:

 281 complete_profile.txt
  81 taxa_ranks/Viridiplantae_family.txt
 108 taxa_ranks/Viridiplantae_genus.txt
 105 taxa_ranks/Viridiplantae_species.txt
 118 taxa_ranks/all_taxa_family.txt
 157 taxa_ranks/all_taxa_genus.txt
 159 taxa_ranks/all_taxa_species.txt

Would it be possible for you to process the same 3 files in order to be able to compare numbers? Mikkel

miwipe commented 2 years ago

@valentinavan alternatively, you can also send me 1 or 2 of your files and I can try and replicate the error? And see what the issue is.

valentinavan commented 2 years ago

Hi Mikkel,

Thanks for replying, I have not had time to try with the SPL files. Will do it soon and I will also send you an email with some of my files so you can try too. Bye

miwipe commented 2 years ago

Excellent, just the .lca files will be enough for now.

miwipe commented 2 years ago

@valentinavan for the table you are skaing for, you can also parse your lca output with this pipe below, it gives you a semicolon seperated output with: Number_reads;Taxonomic_name;Taxonomic_level;Sample_name.

That should give you what you want in a long table format that you can plot with ggplot2 in R. This prints it to screen

for file in *lca 
do
myvar=$(basename $file | cut -f1 -d.)
grep 'Viridiplantae' $file | cut -f11,12,13 -d: | sort  | uniq -c | sed -e 's/^ *//;s/ /;/'| sed 's/:/\t/g' | awk -F"\t" -v myvar=$(basename $file | cut -f1 -d.) 'BEGIN {OFS = ";"} { print $1, $2, myvar }' | sort -k1 -n
done

This saves it to a file (mydata.txt)

for file in *lca 
do
myvar=$(basename $file | cut -f1 -d.)
grep 'Viridiplantae' $file | cut -f11,12,13 -d: | sort  | uniq -c | sed -e 's/^ *//;s/ /;/'| sed 's/:/\t/g' | awk -F"\t" -v myvar=$(basename $file | cut -f1 -d.) 'BEGIN {OFS = ";"} { print $1, $2, myvar }' | sort -k1 -n > mydata.txt
done

And you can do this for any kingdom/phylum or level you want, by replacing Viridiplantae.