sr320 / paper-pano-go

Draft manuscript describing Panopea gonad transcriptome
2 stars 7 forks source link

Data difference #14

Closed mdelrio1 closed 8 years ago

mdelrio1 commented 8 years ago

@sr320 I have a doubt, a) in the 02-Geoduck-protein.ipynb notebook, according with the Deduced Protein based on Transdecoder there are 35951 posible proteins (am I right in the interpretation?) b) from the manuscript annotation file results you wrote that there are "23,165 (15%) annotated sequences", c) I obtained 19652 annotated sequences from the Geoduck-transcriptome-V2-GO-slim.csv. I thought that the 1862 from the file Geoduck_v2_blastn-NT_out.csv, were subtracted from Geoduck-transcriptome-V2-GO-slim.csv, but they add to 21514 (19652+ 1862) and not the 23165 from point a). What am I doing wrong?

there are only 44 sequences from the file Geoduck_v2_blastn-NT_out.csv with an Evalue=0, what do we do with the other sequences? Thanks

sr320 commented 8 years ago

b) The 23,165 refers to the number of transcriptome contigs that had matches, not proteins (though we should add those).

c) No sequences have been subtracted. We can subtract post annotation (now). Related- not all 23165 will necessarily have GOslim information.

d) If we agree only the those 44 bacteria seq should be removed, we probably will not do any thing else with that file. We should keep it on hand and check back at the end to make sure we are not claiming a given contig is important for reproduction, and in fact is likely bacteria with evalue of 1e-100 etc.

It might actually be better / safer to remove all 405 sequences that match with bacteria?

There is the option doing a meta-genomic analysis.

mdelrio1 commented 8 years ago

Ok, thanks, then I'll do the changes in the results section