sr320 / paper-pano-go

Draft manuscript describing Panopea gonad transcriptome
2 stars 7 forks source link

comparison Dheilly seqs with Geo #16

Closed mdelrio1 closed 7 years ago

mdelrio1 commented 8 years ago

@sr320 I used the files

Dheilly_blastx_Geoduck.out

Geo-pep_tblastn_Dheilly.out

and counted the number of instances all sequences appeared in the other database.

For instance with the Dheilly's data with blast results in the Geo database (is my interpretation of the blast results correct?), there are a total of 3871 sequences. While the blast with Geo database in Dheilly we obtain 2799 matches. So what I wanted to know is how many of the sequences were in each other datasets. For instance (after sorting the databases) the sequence AB066348.p.cg.6 (Dheillys to Geo) is present in Geo to Dheillys, while AB289857.p.cg.6 is only present in Dheillys to Geo, and cds.comp112432_c1_seq1 is only present in Geo to Dheillys. The results are as follow:

Dheillys to Geo

sequence Total
present in both 3520
blasted but not present in Geo 351
Grand Total 3871

Geo to Dheillys

sequence Total
present in both 2720
blasted but not present in Dheillys 79
Grand Total 2799

I'm still thinking how to interpret these results, since I was hoping to have about the same amount of blasted sequences and not the 800 difference between them, and why the sequence has a blast result in one dataset, but it is not in the other. Please let me know if I continue to analysis these data in order to obtain the GO and describe them according to the GOslim.

sr320 commented 8 years ago

Before we get too far down into the details, what do you think about how I tried to tackle this?

Here it the results section https://github.com/sr320/paper-pano-go/blob/master/manuscript/results/02-annotation.md

Essentially I used blast followed by joining to find those genes comparable to what they determined to be sex-specific or different during gametogenesis.

the corresponding notebook

mdelrio1 commented 8 years ago

@sr320 I think you´re at the end of the analysis I was trying to do. It´s difficult ("just a little bit") to follow you when you compare the tables with sqlshare. The analysis you carry out on Fri Nov 13 06:42:03 PST 2015 (item [5]) was to joint Dheilly_blastn_Geoduck-v2 data from the blast (DtoG) and add the information from the Sig6_blastn_Sig9 database. But I don´t know what is in this database (Sig6_blastn_Sig9). I´m going to study what you did and I think at the end it was what I wanted to do.

sr320 commented 8 years ago

Understand I spent days :hourglass: figuring this out previously so we could do the same thing for this paper- http://journal.frontiersin.org/article/10.3389/fphys.2014.00224/abstract

For the Dhielly paper they used Sigenae ID numbers (version 6). For our earlier work we needed to transfer CGI IDs thus created a modified blast table to join. This same table could be used for Geoduck work.

I just used this Sig6_blastn_Sig9 to transfer info (join) from blasting everything in Sigenae version 6 fasta, to the sequence IDs they list in their table (see below). Note there would be several ways to address them, some more elegant. For example there is likely some simple bash command that would modify blast table to remove the suffix (.p.cg.6), but since I was joining in SQLshare, it was just as easy to do two joins...

screens

Hope this helps.

mdelrio1 commented 8 years ago

@sr320 Steve I´ve added a figure (Fig. with Dheilly's ( Cluster Des.)) from these data.

https://github.com/sr320/paper-pano-go/blob/master/manuscript/results/02-annotation.md

do you think it is worth? how were you thinking on presenting these data?