Unfolding GO terms in a column w/ Galaxy

Ellior2 commented 6 years ago

I've got a file where I am trying to unfold a column with multiple GO terms delimited by semicolons. I have performed this function many times in Galaxy but for whatever reason I cannot get it to work for me right now.

Here is the file

It's driving me bonkers. Does anyway have an easy work around or see something that might be my problem? I just tried redoing a file I had successfully done in the past and I got the same error.

This is what I am trying:

And this is what I get:

I tried emailing them but have no idea what their turn around is for responding, a day, a week, a month?

sr320 commented 6 years ago

Big picture, would it be correct to assume you are trying to get GO Slim terms for each protein? On Mon, Jul 10, 2017 at 8:55 PM Rhonda Elliott notifications@github.com wrote:

I've got a file where I am trying to unfold a column with multiple GO terms delimited by semicolons. I have performed this function many times in Galaxy but for whatever reason I cannot get it to work for me right now.

Here is the file https://github.com/RobertsLab/project-pacific.oyster-larvae/blob/master/DIA_2015/AnnotatedproteinsGO.tabular

It's driving me bonkers. Does anyway have an easy work around or see something that might be my problem? I just tried redoing a file I had successfully done in the past and I got the same error.

This is what I am trying: [image: image] https://user-images.githubusercontent.com/20071030/28049813-9ea10586-65ae-11e7-8884-69e8455f0371.png

And this is what I get: [image: image] https://user-images.githubusercontent.com/20071030/28049841-cc04d548-65ae-11e7-8bb3-0b4955b87d64.png

I tried emailing them but have no idea what their turn around is for responding, a day, a week, a month?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/654, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHt4HDMLgRUsaMNylQRNn52POfD_Sdks5sMvH8gaJpZM4OTsjd .

Ellior2 commented 6 years ago

Eventually yes. Is there a way to go directly to GO slim from protein name? I just looked at Uniprot and I just see:

I still want to have the GO terms unfolded for use in enrichment tools, but I could bypass this for now if there was a way to go directly to GOslim.

sr320 commented 6 years ago

I do not know an easy way of unfolding...

Here are few possible solutions https://unix.stackexchange.com/questions/184156/scripting-to-split-a-single-csv-row-into-multiple

The other way to go about it is to bypass folding by joint with table that has multiple rows per uniprot.

Note you have a special case here with Gigas - there are a set of tables that can be downloaded with annotations.

I will start new issue on GOslim such this issue can stay focused on unfolding....

kubu4 commented 6 years ago

Can you provide an example (just a subset of rows) of how you want your output file to look (maybe the output from one of you previous successful Galaxy unfolding commands)?

sr320 commented 6 years ago

Input

CGI_10017757    20  27  23  29  K1QXE8  K1QXE8_CRAGI    unreviewed  Actin, cytoplasmic  CGI_10017757    Crassostrea gigas (Pacific oyster) (Crassostrea angulata)   379 GO:0005524
CGI_10019835    7   39  17  37  K1R278  K1R278_CRAGI    unreviewed  Tubulin beta chain  CGI_10019835    Crassostrea gigas (Pacific oyster) (Crassostrea angulata)   446 GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874; GO:0007017

Output.... Too hard to show with iPAD :)

Just imagine from Stack Exchange

kubu4 commented 6 years ago

Sorry, I meant the output file from her unfolding command.

sr320 commented 6 years ago

Minor edit above

But essestially CGI_10019835 line would be present 6 times, with the only difference being a single GO number in the last cell.

kubu4 commented 6 years ago

Sorry for confusion! My bad. I couldn't/didn't see the additional columns with the GO numbers when I opened in Excel.

kubu4 commented 6 years ago

The other way to go about it is to bypass folding by joint with table that has multiple rows per uniprot.

I agree that this is probably the easiest approach (if it's possible).

Along these lines, @Ellior2 , I'm looking back through your notebook to see if there's a previous step that could make things a bit easier. How did you join your "CGI codes and GO terms in Galaxy". The two files that you indicate were joined don't seem to have a common field on which to perform a join.

However, if we just operate on the current file, I have a rough idea for a bash script AND I have a perl script that does, essentially, the opposite (splits components of a lines with duplicate field information and puts all the info into a single line). I think there's a chance I could figure out how to do the reverse. But, figuring either of these things will take me a while...

sr320 commented 6 years ago

Yes that is possible - that file should be in SQLShare.

Though I would be interested to learn how to unfold in bash etc, I hit this a few times- Certainly not a high priority.

and imagine that stack exchange thread has a good solution

Ellior2 commented 6 years ago

I joined my skyline output file with this uniprot table with CGI codes and GO terms using the "Gene Names" (5th) column where you will see "CGI_#######"

I got a response from the Galaxy team

She is mistaken when she says that I got a successful execution of this tool after the failure. I had tried an unfold on a different file with a different delimiter and while it didn't give me this error, it also didn't actually do anything to file. No unfolding occurred.

So I am going to play around the file some more and try to figure out why some of my rows have different numbers of columns... that's very strange. I would just think that they would be empty, not absent...

Ellior2 commented 6 years ago

There is clearly something fishy going on...

This is what the unfold function claims to be able to do:

I just made a simple test file and tried to run it. See right hand side...

unfolded test

I'll email her and keep you updated

Ellior2 commented 6 years ago

And here is a file where I have successfully done this using Galaxy...

https://github.com/RobertsLab/project-pacific.oyster-larvae/blob/master/DDA_2016/GO_slim/Allsamples_Goslimjoin.interval

sr320 commented 6 years ago

Generally speaking it is always good to know what is underneath Galaxy and how to do it on your computer. On Tue, Jul 11, 2017 at 4:52 PM Rhonda Elliott notifications@github.com wrote:

And here is a file where I have successfully done this using Galaxy...

https://github.com/RobertsLab/project-pacific.oyster-larvae/blob/master/DDA_2016/GO_slim/Allsamples_Goslimjoin.interval

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/654#issuecomment-314603677, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHt2iDK1Vdn7_YFzx95nnpf4arONE5ks5sNAqZgaJpZM4OTsjd .

kubu4 commented 6 years ago

I joined my skyline output file with this uniprot table with CGI codes and GO terms using the "Gene Names" (5th) column where you will see "CGI_#######"

Ah, I see! I had to scroll way down to see entries that have the CGI code. Again, sorry for confusion!!

kubu4 commented 6 years ago

I put together a script that will break out each GO term associated with a CGI accession to individual lines. That means there will be multiple rows per CGI accession.

Here it is, if anyone wants to try it. I tested it and it looks OK to me.

Please keep in mind, I wrote it to specifically work on the problem file @Ellior2 linked in her initial post: AnnotatedproteinsGO.tabular

Knowing that, it might not be useful beyond this singular example (it's written to operate on Column 13, which is the column containing her GO terms in that file). However, if desired, the script can probably be adjusted to be more flexible (e.g. automatically identify the column containing the GO terms).

I haven't added comments yet, so it might be hard to follow, but here are the steps that need to be followed to get it to work:

Copy and paste the script below to a new text file (do not use a word processor, like Microsoft Word!).
Change the paths of the original file (on line 3) and provide a path for a new file name (on line 3)
Change the path on line 5 to the same path for your new file name.
Save the file as: unfold.sh
Open Terminal.
Change to the directory where you saved your script.
Run the script by typing: bash unfold.sh
The output file will be called unfolded.tab. It will be located in the same directory as your script.

#!/bin/bash

sed 's/; /\t/g' /path/to/orginal/file > /path/to/new/filename

file="/path/to/new/filename"
while read -r line
    do
    max_field=$(echo "$line" | awk -F'\t' '{print NF}')
    set_fields=$(echo "$line" | cut -f1-12)
    if (( "$max_field" < 13 ))
        then printf "%s\n" "$line"
        else
            goterms=$(echo "$line" | cut -f13-$max_field)
            IFS=$'\t' read -r -a array <<<"$goterms"
                for element in "${!array[@]}"   
                    do printf "%s\t%s\n" "$set_fields" "${array[$element]}"
                done
    fi
done < "$file" > unfolded.tab

Ellior2 commented 6 years ago

Wow, awesome directions @kubu4 . Worked like a charm!

Thank you so much, I really appreciate it.

sr320 / LabDocs

Unfolding GO terms in a column w/ Galaxy #654