Closed sr320 closed 8 years ago
blastn is complete
table @ https://github.com/sr320/paper-pano-go/blob/master/jupyter-nbs/analyses/Geoduck_v2_blastn-NT.out
Now need to decide on which contigs should be classified as likely from associated bacteria in gonadal tissue.....
What would be the criteria to do it?
That is a great question- I would say that anything that hits "Bacteria" at an evalue of '0' should be considered likely bacteria, and not Geoduck. I am up for suggestions to what else we should consider.
What are persons thoughts when looking at taxa hits @ https://github.com/sr320/paper-pano-go/blob/master/jupyter-nbs/analyses/Geoduck_v2_blastn-NT.out
Once we agree how stringent to be, we could do a join to pull out those seqs from further analysis.
@sr320 @lafarga13 from this file there are 405 (out of 1862=22%) with a max e-value of 8E-21 <img src="https://github.com/mdelrio1/mdelrio-panopea1/blob/master/img/Geoduck_v2_blastn-NT.out.png"/ width = 50%>
the frequency distribution of the eukaryota/bacteria is as follows. <img src="https://github.com/mdelrio1/mdelrio-panopea1/blob/master/img/EukaryotaBacteria.png"/ width = 50%>
As you can see there are 48 sequences with a zero value. Although, I'm not sure how to use this information.
I am thinking now maybe we remove all 405?
Also I notice there are another 20 that are likely not Geoduck
fgrep "N/A" analyses/Geoduck_v2_blastn-NT.out
comp28250_c0_seq1 gi 514055706 gb KC802228.1 100.00 181 0 0 18 198 216 36 1e-88 335 N/A Synthetic construct breast cancer binding peptide PC82 gene, partial cds 32630 synthetic construct synthetic construct other sequences
comp61908_c0_seq1 gi 389297270 gb JQ794641.1 96.40 222 8 0 1 222 240 19 4e-98 366 N/A Uncultured Petrobacter sp. clone OTU-17 16S ribosomal RNA gene, partial sequence 463796 N/A N/A N/A
comp95185_c0_seq1 gi 539360076 gb KC989926.1 100.00 343 0 0 1 343 1500 1842 1e-178 634 N/A Cloning vector pSTn5-KM, complete sequence 1389801 Cloning vector pSTn5-KM Cloning vector pSTn5-KM other sequences
comp95185_c1_seq1 gi 499074117 gb KC577243.1 100.00 226 0 0 1 226 3832 4057 1e-113 418 N/A Cloning vector pR6KT-miniTn7T-P1eGFP-FK, complete sequence 1332675 Cloning vector pR6KT-miniTn7T-P1eGFP-FK Cloning vector pR6KT-miniTn7T-P1eGFP-FK other sequences
comp95185_c1_seq2 gi 18150422 gb AF409199.1 99.05 105 1 0 180 284 4121 4017 1e-44 189 N/A Shuttle vector pCE320, partial sequence 183765 Shuttle vector pCE320 Shuttle vector pCE320 other sequences
comp95185_c2_seq1 gi 459360454 gb KC200570.1 100.00 268 0 0 1 268 234 501 6e-137 496 N/A Binary vector pYBA-300, complete sequence 1301036 Binary vector pYBA-300 Binary vector pYBA-300 other sequences
comp98229_c0_seq1 gi 259116154 gb GQ874257.1 95.79 214 9 0 1 214 790 1003 5e-92 346 N/A Uncultured organism clone 1041059766404 genomic sequence 155900 uncultured organism uncultured organism N/A
comp101927_c3_seq1 gi 195934828 gb BC168400.1 80.35 173 33 1 194 366 974 803 1e-26 130 N/A Synthetic construct Mus musculus clone IMAGE:100068369, MGC:195913 tau tubulin kinase 2 (Ttbk2) mRNA, encodes complete protein 32630 synthetic construct synthetic construct other sequences
comp104423_c0_seq1 gi 539360076 gb KC989926.1 99.85 664 1 0 1 664 2305 2968 0.0 1221 N/A Cloning vector pSTn5-KM, complete sequence 1389801 Cloning vector pSTn5-KM Cloning vector pSTn5-KM other sequences
comp115679_c0_seq2 gi 259116154 gb GQ874257.1 94.87 234 8 2 1 234 1349 1120 5e-97 363 N/A Uncultured organism clone 1041059766404 genomic sequence 155900 uncultured organism uncultured organism N/A
comp115969_c0_seq1 gi 512388800 emb HG315104.1 100.00 471 0 0 1 471 1264 1734 0.0 870 N/A Streptococcus sp. DSM 27088 partial 23S rRNA gene, strain DSM 27089, isolate 7746 1345497 N/A N/A N/A
comp126119_c0_seq1 gi 371881539 emb FQ727577.1 94.69 245 11 2 1 244 445 202 5e-102 379 N/A 16S rRNA amplicon fragment from a soil sample (ferralsol, Madagascar) resulting from a 16 days laboratory incubation experiment in the presence of 13C-enriched wheat-straw : Light-DNA fraction (DNA-SIP technique) 32644 unidentified unidentified N/A
comp135476_c0_seq5 gi 254048722 gb GQ233872.1 89.88 257 26 0 1 257 803 547 2e-87 331 N/A Uncultured marine organism clone IOBCBE001_08-A08-SP6.ab1 genomic sequence 360281 uncultured marine organism uncultured marine organism N/A
comp137358_c0_seq11 gi 364588385 gb JN436381.1 89.50 1209 109 13 65 1269 127 1321 0.0 1513 N/A Uncultured organism clone SBXZ_5221 16S ribosomal RNA gene, partial sequence 155900 uncultured organism uncultured organism N/A
comp138387_c0_seq4 gi 168151307 emb CU674602.1 78.25 308 65 2 1 307 322 16 1e-46 196 N/A Synthetic construct Homo sapiens gateway clone IMAGE:100018300 5' read TUBB2A mRNA 32630 synthetic construct synthetic construct other sequences
comp138387_c0_seq6 gi 168151367 emb CU674662.1 78.99 714 150 0 1 714 730 17 3e-134 488 N/A Synthetic construct Homo sapiens gateway clone IMAGE:100018301 5' read TUBB2B mRNA 32630 synthetic construct synthetic construct other sequences
comp141713_c0_seq1 gi 312152669 gb HQ448367.1 75.84 592 131 6 1356 1944 663 81 2e-74 291 N/A Synthetic construct Homo sapiens clone IMAGE:100071791; CCSB003826_02 polymerase (RNA) II (DNA directed) polypeptide E, 25kDa (POLR2E) gene, encodes complete protein 32630 synthetic construct synthetic construct other sequences
comp142037_c1_seq1 gi 117645665 emb AM393421.1 72.23 1019 246 31 1623 2624 1592 2590 1e-70 279 N/A Synthetic construct Homo sapiens clone IMAGE:100001729 for hypothetical protein (CYFIP2 gene) 32630 synthetic construct synthetic construct other sequences
comp144044_c1_seq9 gi 293651473 dbj AB553833.1 91.59 416 35 0 1 416 11771 12186 1e-160 575 N/A Human artificial chromosome vector 21HAC4 DNA, isolated from the short arm, clone: YAC/BAC#37-2 751903 Human artificial chromosome vector 21HAC4 Human artificial chromosome vector 21HAC4 other sequences
comp153429_c0_seq1 gi 29825358 gb AY238516.1 99.19 246 2 0 1 246 678 433 2e-121 444 N/A Synthetic construct triacylglycerol lipase gene, complete cds 32630 synthetic construct synthetic construct other sequences
@sr320 Yes, i noticed them, did you eliminate the "N/A" data from the database with the "fgrep"? has the Geoduck_v2_blastn-NT.out file been updated? in order to download it. Should we rename it?
@mdelrio1 I took the transcriptome and removed the Bacteria and N/A sequence - here
I created a new fasta file called Geoduck-transcriptome-v3.fa
Because it is >100 MB, only the zipped version is in the repo.
It is located at ../data-results/
We can also remove these seqs post analysis.
@sr320 Ok, I'm working on it.
@sr320 Hello, I have carried out the analysis, but, the fasta file did not have the annotations, so I extracted the sequence names and other information from the fasta file and created a small database in excel (sorry, I need more time to work out in IPython how to handle a database) and then selected those sequences that were present in this database (no bacteria, nor N/A). Then I build a pivot table and obtained the fig. which is in the manuscript (I left the first one, for you to see both). Furthermore, since we're looking for reproduction genes, I extracted from "other biological processes" those genes that were related with reproduction and separated them. You'll be able to see a 0.05% reproduction,which is equivalent to nine sequences. However, when working with the whole database there are 109 genes related with reproduction. I think I can get them out. But I have some doubts
GO slim | sequences |
---|---|
RNA metabolism | 2378 |
transport | 2242 |
protein metabolism | 2053 |
developmental processes | 1295 |
signal transduction | 1092 |
stress response | 729 |
cell organization and biogenesis | 721 |
cell cycle and proliferation | 606 |
DNA metabolism | 575 |
cell adhesion | 455 |
death | 308 |
cell-cell signaling | 43 |
reproduction | 9 |
other biological processes | 3327 |
other metabolic processes | 3650 |
which adds to 19483 (see the formatted table https://github.com/mdelrio1/mdelrio-panopea1/blob/master/notes/TableGeoduck.md)
but when bacteria and N/A are considered there are 19652 unique sequences, so there is a difference of 169 and not the 405 you removed .
So in summary Data from Geoduck-transcriptome-V3-GO.fa were 'annotated' using Geoduck-transcriptome-V2-GO-GOslim. A graph was built using data from a pivot table. Reproduction annotation was extracted from "other biological processes" in order to have them as a separated item.
Sounds like we have addressed the occurrence of non euk seqs in different ways (which is great as it provides checks).
I would be interested to know how you were able to do this in Excel.
Related, searching ../data-results/Geoduck-transcriptome-v2-GO-Slim.csv
for reproduction results in 222 hits.
As I believe we have addressed the non-euk isssue I will go ahead and close this issue.
Started analysis see: http://nbviewer.ipython.org/urls/dl.dropbox.com/s/ylm22nfatkaglp8/Crazy-blast-Geoduck-v2-NT.ipynb