Annotation - Githubissues

mdelrio1 commented 9 years ago

Hi Steven I´m trying to describe the annotation data, but have some problems. I have checked the files: a) Geoduck-transcriptome-v2.fasta b) Geoduck-tranv2-blastx_sprot.tab and in this file I did´t obtained the GO data. Could you please tell me the headers (I could´t find the file with the header of the columns, sorry) and how to obtain the GO information. Thanks

These are the first five rows of the Geoduck-tranv2-blastx_sprot.tab file.

comp95_c0_seq1 sp Q8K358 PIGU_MOUSE 67.53 77 25 0 231 1 258 334 7.00E-32 119 comp146_c0_seq1 sp P37137 LHX5_XENLA 79.31 58 12 0 175 2 4 61 4.00E-31 116 comp195_c0_seq1 sp Q8HXQ0 SODC_MACMU 54.84 62 28 0 188 3 80 141 1.00E-14 68.9 comp296_c0_seq1 sp P59966 DNAB_MYCBO 67.16 67 22 0 201 1 72 138 1.00E-11 63.2 comp434_c0_seq1 sp Q07954 LRP1_HUMAN 38.46 78 44 3 34 267 4094 4167 8.00E-10 59.3

sr320 commented 9 years ago

Hi Miguel- You are correct, the GO information in not included in those files yet. The following code should get GO and GO slim information in SQLShare. sqlshare.escience.washington.edu

SELECT * FROM [sr320@washington.edu].[Geoduck-tranv2-blastx_sprot]blast 
left join
 [sr320@washington.edu].[SPID and GO Numbers]go 
on blast.Column3=go.SPID 
left join [sr320@washington.edu].[GO_to_GOslim]slim 
on go.GOID=slim.GO_id where aspect like 'P'

I am trying now on SQLShare to get the results- but it is currently still :runner:

mdelrio1 commented 9 years ago

Thanks Steven, I think it´s best to wait or should I run the code?

sr320 commented 9 years ago

If you want to- go ahead and see if you can get results from your SQLShare account. On Wed, Nov 4, 2015 at 3:29 PM Miguel del Rio notifications@github.com wrote:

Thanks Steven, I think it´s best to wait or should I run the code?

— Reply to this email directly or view it on GitHub https://github.com/sr320/paper-pano-go/issues/10#issuecomment-153903805.

Steven Roberts http://faculty.washington.edu/sr320/

mdelrio1 commented 9 years ago

OK, I´ll run it. It's running

mdelrio1 commented 9 years ago

Hi Steven It finished, but I couldn't download the database (Geoduck-tranv2-GO ), I shared it with you, however it seems something is wrong, it has 100 Rows and 20 Columns! Could you please tell me what I did wrong. I'm attaching the snapshot of the run.n Thanks

sr320 commented 9 years ago

From my end it looks like you did it correctly see screenshot- preview is 100 but there are 100k+ records.

I think there is a problem on the server side. I am trying to download but it is just "waiting".

I will let it keep going and let you know if I can get a successful download.

mdelrio1 commented 9 years ago

Thanks Steven I agree with you, in the screen shot I took there were 100 sequences, but know I just entered and there are "Rows 1 - 100 of 102358 “ as in the screen shot you sent me. but I can't download the files. I'll wait to. Thanks

sr320 commented 9 years ago

Ok I think I finally got it... I ended up doing the two joins separately, then creating a "snapshot", before downloading...

The CSV file with GO and GO slim (BP only) information is now in the repo

https://raw.githubusercontent.com/sr320/paper-pano-go/master/data-results/Geoduck-transcriptome-v2-GO-Slim.csv

mdelrio1 commented 9 years ago

Thanks Steven I manage to download the file this morning too, I got the same results with both files. This is the image for the annotation. Please let me know whether you prefer a different setting. https://github.com/mdelrio1/mdelrio-panopea1/blob/master/img/Panopea_annotation.png

lafarga13 commented 9 years ago

I am confuse, are you running the annotation again? cause I am currently using the female and male data to generate the CpG's of share genes among them ... Is that ok? As I understand you are using "panopea" all data together for this annotation... right?

mdelrio1 commented 9 years ago

Steven, the image was with all annotated data, here it is the no-duplicate graph. https://github.com/mdelrio1/mdelrio-panopea1/blob/master/img/Panopea_annotationNoduplicates.png

sr320 commented 9 years ago

@lafarga13 - go ahead and do the female and male data. That would be great. Once the analysis workflow is set (or we see something interesting with your analysis) we could do full transcriptome easily.

sr320 commented 9 years ago

@mdelrio1 Looks good! "GO" ahead and add text, figures to main paper repo. You have write access so you can edit directly just as if it was one of your repos.

mdelrio1 commented 9 years ago

@sr320 I'll add the fig and write something in the paper repo. Thanks

sr320 commented 9 years ago

@mdelrio1 do you have a table with unique GOslim information for each contig?

mdelrio1 commented 9 years ago

@sr320 Yes, I'll add it to the data-resuts as an excel file with two sheets, one with all information and the second with unique GOslim, unless you say otherwise.

mdelrio1 commented 9 years ago

@sr320 The file is in my repository https://github.com/mdelrio1/mdelrio-panopea1/blob/master/data/Geoduck-transcriptome-v2-GO-Slim.xlsx I couldn't upload it at the data folder.

sr320 commented 9 years ago

@mdelrio1 Can you save as a CSV, then add? If not I can try.

NOTE you should probably rename with 'unique' as there is already a file with this name which has all GOslim info.

mdelrio1 commented 9 years ago

@sr320 I have uploaded the .csv file and rename it, https://github.com/mdelrio1/mdelrio-panopea1/blob/master/data/Geoduck-transcriptome-v2-GO-SlimUnique.csv it only has the unique GO results Instead of adding, tried to word count the rows as !wc ../panopea_data/data-results/Geoduck-transcriptome-v2-GO-SlimUnique.csv but I've got 0 81687 3320985 ../panopea_data/data-results/Geoduck-transcriptome-v2-GO-SlimUnique.csv zero rows? so in order to obtain the information I also tried !grep -c "comp" ../panopea_data/data-results/Geoduck-transcriptome-v2-GO-SlimUnique.csv thinking that all rows have a "comp" as part of the name, but it gave me `1 how do you count rows in the .csv files?

sr320 commented 9 years ago

I have added that file to this repo.

What you have experienced just one of the side effects of using Excel :smile: . Saving as csv in Excel uses non unix line breaks.

I opened the csv up in TextWrangler and re-saved.

changed to Unix unix

and wc now indicates

 wc -l /Users/sr320/git-repos/paper-pano-go/data-results/Geoduck-transcriptome-v2-GO-SlimUnique.csv
   19652 /Users/sr320/git-repos/paper-pano-go/data-results/Geoduck-transcriptome-v2-GO-SlimUnique.csv

Again the location is now in data-results

mdelrio1 commented 9 years ago

@sr320 Thanks I was going to say that there seemed to be only one line!! thanks again I´ll work with TextWrangler

sr320 / paper-pano-go

Annotation #10