sr320 / paper-pano-go

Draft manuscript describing Panopea gonad transcriptome
2 stars 7 forks source link

Need to check transcriptome for non-Geoduck sequences #9

Closed sr320 closed 8 years ago

sr320 commented 8 years ago

Started analysis see: http://nbviewer.ipython.org/urls/dl.dropbox.com/s/ylm22nfatkaglp8/Crazy-blast-Geoduck-v2-NT.ipynb

sr320 commented 8 years ago

blastn is complete

table @ https://github.com/sr320/paper-pano-go/blob/master/jupyter-nbs/analyses/Geoduck_v2_blastn-NT.out

Now need to decide on which contigs should be classified as likely from associated bacteria in gonadal tissue.....

lafarga13 commented 8 years ago

What would be the criteria to do it?

sr320 commented 8 years ago

That is a great question- I would say that anything that hits "Bacteria" at an evalue of '0' should be considered likely bacteria, and not Geoduck. I am up for suggestions to what else we should consider.

What are persons thoughts when looking at taxa hits @ https://github.com/sr320/paper-pano-go/blob/master/jupyter-nbs/analyses/Geoduck_v2_blastn-NT.out

Once we agree how stringent to be, we could do a join to pull out those seqs from further analysis.

mdelrio1 commented 8 years ago

@sr320 @lafarga13 from this file there are 405 (out of 1862=22%) with a max e-value of 8E-21 <img src="https://github.com/mdelrio1/mdelrio-panopea1/blob/master/img/Geoduck_v2_blastn-NT.out.png"/ width = 50%>

the frequency distribution of the eukaryota/bacteria is as follows. <img src="https://github.com/mdelrio1/mdelrio-panopea1/blob/master/img/EukaryotaBacteria.png"/ width = 50%>

As you can see there are 48 sequences with a zero value. Although, I'm not sure how to use this information.

sr320 commented 8 years ago

I am thinking now maybe we remove all 405?

sr320 commented 8 years ago

Also I notice there are another 20 that are likely not Geoduck

fgrep "N/A" analyses/Geoduck_v2_blastn-NT.out

comp28250_c0_seq1   gi  514055706   gb  KC802228.1      100.00  181 0   0   18  198 216 36  1e-88     335   N/A Synthetic construct breast cancer binding peptide PC82 gene, partial cds    32630   synthetic construct synthetic construct other sequences
comp61908_c0_seq1   gi  389297270   gb  JQ794641.1      96.40   222 8   0   1   222 240 19  4e-98     366   N/A Uncultured Petrobacter sp. clone OTU-17 16S ribosomal RNA gene, partial sequence    463796  N/A N/A N/A
comp95185_c0_seq1   gi  539360076   gb  KC989926.1      100.00  343 0   0   1   343 1500    1842    1e-178    634   N/A Cloning vector pSTn5-KM, complete sequence  1389801 Cloning vector pSTn5-KM Cloning vector pSTn5-KM other sequences
comp95185_c1_seq1   gi  499074117   gb  KC577243.1      100.00  226 0   0   1   226 3832    4057    1e-113    418   N/A Cloning vector pR6KT-miniTn7T-P1eGFP-FK, complete sequence  1332675 Cloning vector pR6KT-miniTn7T-P1eGFP-FK Cloning vector pR6KT-miniTn7T-P1eGFP-FK other sequences
comp95185_c1_seq2   gi  18150422    gb  AF409199.1      99.05   105 1   0   180 284 4121    4017    1e-44     189   N/A Shuttle vector pCE320, partial sequence 183765  Shuttle vector pCE320   Shuttle vector pCE320   other sequences
comp95185_c2_seq1   gi  459360454   gb  KC200570.1      100.00  268 0   0   1   268 234 501 6e-137    496   N/A Binary vector pYBA-300, complete sequence   1301036 Binary vector pYBA-300  Binary vector pYBA-300  other sequences
comp98229_c0_seq1   gi  259116154   gb  GQ874257.1      95.79   214 9   0   1   214 790 1003    5e-92     346   N/A Uncultured organism clone 1041059766404 genomic sequence    155900  uncultured organism uncultured organism N/A
comp101927_c3_seq1  gi  195934828   gb  BC168400.1      80.35   173 33  1   194 366 974 803 1e-26     130   N/A Synthetic construct Mus musculus clone IMAGE:100068369, MGC:195913 tau tubulin kinase 2 (Ttbk2) mRNA, encodes complete protein  32630   synthetic construct synthetic construct other sequences
comp104423_c0_seq1  gi  539360076   gb  KC989926.1      99.85   664 1   0   1   664 2305    2968    0.0  1221   N/A Cloning vector pSTn5-KM, complete sequence  1389801 Cloning vector pSTn5-KM Cloning vector pSTn5-KM other sequences
comp115679_c0_seq2  gi  259116154   gb  GQ874257.1      94.87   234 8   2   1   234 1349    1120    5e-97     363   N/A Uncultured organism clone 1041059766404 genomic sequence    155900  uncultured organism uncultured organism N/A
comp115969_c0_seq1  gi  512388800   emb HG315104.1      100.00  471 0   0   1   471 1264    1734    0.0   870   N/A Streptococcus sp. DSM 27088 partial 23S rRNA gene, strain DSM 27089, isolate 7746   1345497 N/A N/A N/A
comp126119_c0_seq1  gi  371881539   emb FQ727577.1      94.69   245 11  2   1   244 445 202 5e-102    379   N/A 16S rRNA amplicon fragment from a soil sample (ferralsol, Madagascar) resulting from a 16 days laboratory incubation experiment in the presence of 13C-enriched wheat-straw : Light-DNA fraction (DNA-SIP technique)    32644   unidentified    unidentified    N/A
comp135476_c0_seq5  gi  254048722   gb  GQ233872.1      89.88   257 26  0   1   257 803 547 2e-87     331   N/A Uncultured marine organism clone IOBCBE001_08-A08-SP6.ab1 genomic sequence  360281  uncultured marine organism  uncultured marine organism  N/A
comp137358_c0_seq11 gi  364588385   gb  JN436381.1      89.50   1209    109 13  65  1269    127 1321    0.0  1513   N/A Uncultured organism clone SBXZ_5221 16S ribosomal RNA gene, partial sequence    155900  uncultured organism uncultured organism N/A
comp138387_c0_seq4  gi  168151307   emb CU674602.1      78.25   308 65  2   1   307 322 16  1e-46     196   N/A Synthetic construct Homo sapiens gateway clone IMAGE:100018300 5' read TUBB2A mRNA  32630   synthetic construct synthetic construct other sequences
comp138387_c0_seq6  gi  168151367   emb CU674662.1      78.99   714 150 0   1   714 730 17  3e-134    488   N/A Synthetic construct Homo sapiens gateway clone IMAGE:100018301 5' read TUBB2B mRNA  32630   synthetic construct synthetic construct other sequences
comp141713_c0_seq1  gi  312152669   gb  HQ448367.1      75.84   592 131 6   1356    1944    663 81  2e-74     291   N/A Synthetic construct Homo sapiens clone IMAGE:100071791; CCSB003826_02 polymerase (RNA) II (DNA directed) polypeptide E, 25kDa (POLR2E) gene, encodes complete protein   32630   synthetic construct synthetic construct other sequences
comp142037_c1_seq1  gi  117645665   emb AM393421.1      72.23   1019    246 31  1623    2624    1592    2590    1e-70     279   N/A Synthetic construct Homo sapiens clone IMAGE:100001729 for hypothetical protein (CYFIP2 gene)   32630   synthetic construct synthetic construct other sequences
comp144044_c1_seq9  gi  293651473   dbj AB553833.1      91.59   416 35  0   1   416 11771   12186   1e-160    575   N/A Human artificial chromosome vector 21HAC4 DNA, isolated from the short arm, clone: YAC/BAC#37-2 751903  Human artificial chromosome vector 21HAC4   Human artificial chromosome vector 21HAC4   other sequences
comp153429_c0_seq1  gi  29825358    gb  AY238516.1      99.19   246 2   0   1   246 678 433 2e-121    444   N/A Synthetic construct triacylglycerol lipase gene, complete cds   32630   synthetic construct synthetic construct other sequences
mdelrio1 commented 8 years ago

@sr320 Yes, i noticed them, did you eliminate the "N/A" data from the database with the "fgrep"? has the Geoduck_v2_blastn-NT.out file been updated? in order to download it. Should we rename it?

sr320 commented 8 years ago

@mdelrio1 I took the transcriptome and removed the Bacteria and N/A sequence - here

I created a new fasta file called Geoduck-transcriptome-v3.fa Because it is >100 MB, only the zipped version is in the repo.

It is located at ../data-results/

We can also remove these seqs post analysis.

mdelrio1 commented 8 years ago

@sr320 Ok, I'm working on it.

mdelrio1 commented 8 years ago

@sr320 Hello, I have carried out the analysis, but, the fasta file did not have the annotations, so I extracted the sequence names and other information from the fasta file and created a small database in excel (sorry, I need more time to work out in IPython how to handle a database) and then selected those sequences that were present in this database (no bacteria, nor N/A). Then I build a pivot table and obtained the fig. which is in the manuscript (I left the first one, for you to see both). Furthermore, since we're looking for reproduction genes, I extracted from "other biological processes" those genes that were related with reproduction and separated them. You'll be able to see a 0.05% reproduction,which is equivalent to nine sequences. However, when working with the whole database there are 109 genes related with reproduction. I think I can get them out. But I have some doubts

GO slim sequences
RNA metabolism 2378
transport 2242
protein metabolism 2053
developmental processes 1295
signal transduction 1092
stress response 729
cell organization and biogenesis 721
cell cycle and proliferation 606
DNA metabolism 575
cell adhesion 455
death 308
cell-cell signaling 43
reproduction 9
other biological processes 3327
other metabolic processes 3650

which adds to 19483 (see the formatted table https://github.com/mdelrio1/mdelrio-panopea1/blob/master/notes/TableGeoduck.md)

but when bacteria and N/A are considered there are 19652 unique sequences, so there is a difference of 169 and not the 405 you removed .

So in summary Data from Geoduck-transcriptome-V3-GO.fa were 'annotated' using Geoduck-transcriptome-V2-GO-GOslim. A graph was built using data from a pivot table. Reproduction annotation was extracted from "other biological processes" in order to have them as a separated item.

sr320 commented 8 years ago

Sounds like we have addressed the occurrence of non euk seqs in different ways (which is great as it provides checks).

I would be interested to know how you were able to do this in Excel.

Related, searching ../data-results/Geoduck-transcriptome-v2-GO-Slim.csv for reproduction results in 222 hits.

As I believe we have addressed the non-euk isssue I will go ahead and close this issue.