Closed Ellior2 closed 6 years ago
Big picture, would it be correct to assume you are trying to get GO Slim terms for each protein? On Mon, Jul 10, 2017 at 8:55 PM Rhonda Elliott notifications@github.com wrote:
I've got a file where I am trying to unfold a column with multiple GO terms delimited by semicolons. I have performed this function many times in Galaxy but for whatever reason I cannot get it to work for me right now.
Here is the file https://github.com/RobertsLab/project-pacific.oyster-larvae/blob/master/DIA_2015/AnnotatedproteinsGO.tabular
It's driving me bonkers. Does anyway have an easy work around or see something that might be my problem? I just tried redoing a file I had successfully done in the past and I got the same error.
This is what I am trying: [image: image] https://user-images.githubusercontent.com/20071030/28049813-9ea10586-65ae-11e7-8884-69e8455f0371.png
And this is what I get: [image: image] https://user-images.githubusercontent.com/20071030/28049841-cc04d548-65ae-11e7-8bb3-0b4955b87d64.png
I tried emailing them but have no idea what their turn around is for responding, a day, a week, a month?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/654, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHt4HDMLgRUsaMNylQRNn52POfD_Sdks5sMvH8gaJpZM4OTsjd .
Eventually yes. Is there a way to go directly to GO slim from protein name? I just looked at Uniprot and I just see:
I still want to have the GO terms unfolded for use in enrichment tools, but I could bypass this for now if there was a way to go directly to GOslim.
I do not know an easy way of unfolding...
Here are few possible solutions https://unix.stackexchange.com/questions/184156/scripting-to-split-a-single-csv-row-into-multiple
The other way to go about it is to bypass folding by joint with table that has multiple rows per uniprot.
Note you have a special case here with Gigas - there are a set of tables that can be downloaded with annotations.
I will start new issue on GOslim such this issue can stay focused on unfolding....
Can you provide an example (just a subset of rows) of how you want your output file to look (maybe the output from one of you previous successful Galaxy unfolding commands)?
Input
CGI_10017757 20 27 23 29 K1QXE8 K1QXE8_CRAGI unreviewed Actin, cytoplasmic CGI_10017757 Crassostrea gigas (Pacific oyster) (Crassostrea angulata) 379 GO:0005524
CGI_10019835 7 39 17 37 K1R278 K1R278_CRAGI unreviewed Tubulin beta chain CGI_10019835 Crassostrea gigas (Pacific oyster) (Crassostrea angulata) 446 GO:0003924; GO:0005200; GO:0005525; GO:0005737; GO:0005874; GO:0007017
Output.... Too hard to show with iPAD :)
Just imagine from Stack Exchange
Sorry, I meant the output file from her unfolding command.
Minor edit above
But essestially CGI_10019835 line would be present 6 times, with the only difference being a single GO number in the last cell.
Sorry for confusion! My bad. I couldn't/didn't see the additional columns with the GO numbers when I opened in Excel.
The other way to go about it is to bypass folding by joint with table that has multiple rows per uniprot.
I agree that this is probably the easiest approach (if it's possible).
Along these lines, @Ellior2 , I'm looking back through your notebook to see if there's a previous step that could make things a bit easier. How did you join your "CGI codes and GO terms in Galaxy". The two files that you indicate were joined don't seem to have a common field on which to perform a join.
However, if we just operate on the current file, I have a rough idea for a bash script AND I have a perl script that does, essentially, the opposite (splits components of a lines with duplicate field information and puts all the info into a single line). I think there's a chance I could figure out how to do the reverse. But, figuring either of these things will take me a while...
Yes that is possible - that file should be in SQLShare.
Though I would be interested to learn how to unfold in bash etc, I hit this a few times- Certainly not a high priority.
and imagine that stack exchange thread has a good solution
I joined my skyline output file with this uniprot table with CGI codes and GO terms using the "Gene Names" (5th) column where you will see "CGI_#######"
I got a response from the Galaxy team
She is mistaken when she says that I got a successful execution of this tool after the failure. I had tried an unfold on a different file with a different delimiter and while it didn't give me this error, it also didn't actually do anything to file. No unfolding occurred.
So I am going to play around the file some more and try to figure out why some of my rows have different numbers of columns... that's very strange. I would just think that they would be empty, not absent...
There is clearly something fishy going on...
This is what the unfold function claims to be able to do:
I just made a simple test file and tried to run it. See right hand side...
I'll email her and keep you updated
And here is a file where I have successfully done this using Galaxy...
Generally speaking it is always good to know what is underneath Galaxy and how to do it on your computer. On Tue, Jul 11, 2017 at 4:52 PM Rhonda Elliott notifications@github.com wrote:
And here is a file where I have successfully done this using Galaxy...
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/654#issuecomment-314603677, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHt2iDK1Vdn7_YFzx95nnpf4arONE5ks5sNAqZgaJpZM4OTsjd .
I joined my skyline output file with this uniprot table with CGI codes and GO terms using the "Gene Names" (5th) column where you will see "CGI_#######"
Ah, I see! I had to scroll way down to see entries that have the CGI code. Again, sorry for confusion!!
I put together a script that will break out each GO term associated with a CGI accession to individual lines. That means there will be multiple rows per CGI accession.
Here it is, if anyone wants to try it. I tested it and it looks OK to me.
Please keep in mind, I wrote it to specifically work on the problem file @Ellior2 linked in her initial post: AnnotatedproteinsGO.tabular
Knowing that, it might not be useful beyond this singular example (it's written to operate on Column 13, which is the column containing her GO terms in that file). However, if desired, the script can probably be adjusted to be more flexible (e.g. automatically identify the column containing the GO terms).
I haven't added comments yet, so it might be hard to follow, but here are the steps that need to be followed to get it to work:
unfold.sh
bash unfold.sh
unfolded.tab
. It will be located in the same directory as your script.#!/bin/bash
sed 's/; /\t/g' /path/to/orginal/file > /path/to/new/filename
file="/path/to/new/filename"
while read -r line
do
max_field=$(echo "$line" | awk -F'\t' '{print NF}')
set_fields=$(echo "$line" | cut -f1-12)
if (( "$max_field" < 13 ))
then printf "%s\n" "$line"
else
goterms=$(echo "$line" | cut -f13-$max_field)
IFS=$'\t' read -r -a array <<<"$goterms"
for element in "${!array[@]}"
do printf "%s\t%s\n" "$set_fields" "${array[$element]}"
done
fi
done < "$file" > unfolded.tab
Wow, awesome directions @kubu4 . Worked like a charm!
Thank you so much, I really appreciate it.
I've got a file where I am trying to unfold a column with multiple GO terms delimited by semicolons. I have performed this function many times in Galaxy but for whatever reason I cannot get it to work for me right now.
Here is the file
It's driving me bonkers. Does anyway have an easy work around or see something that might be my problem? I just tried redoing a file I had successfully done in the past and I got the same error.
This is what I am trying:
And this is what I get:
I tried emailing them but have no idea what their turn around is for responding, a day, a week, a month?