Closed willking2 closed 9 years ago
Those might all work - the first thing that comes to mind is a join
though. Either before or after you subset stress response. The join can be
done in SQLShare or bash.
Technically the subsetting can also be done in SQLShare too; so in my mind this would be most elegant as you generated the files there.
Steven Roberts faculty.washington.edu/sr320
On Thu, Feb 19, 2015 at 9:53 PM, Will King notifications@github.com wrote:
I have two files, one of contigs annotated with GO information ( Nlap_annotated_GO.csv https://github.com/willking2/fish546_W15/blob/master/nlap-ano/products/Nlap_annotated_GO.csv) and one of contigs annotated with protein names ( Nlap_annotated_proteinnames.csv https://github.com/willking2/fish546_W15/blob/master/nlap-ano/products/Nlap_annotated_proteinnames.csv). They both have SPIDs.
I've successfully made a subset of Nlap_annotated_GO.csv that contains only contigs related to stress response. I did this using awk:
$ awk -F"," '/[Ss]tress response/ {print $0}' Nlap_annotated_GO.csv > Nlap_annotated_GO_stress.csv
Now, I would like to make a subset of Nlap_annotated_proteinnames.csv that contains only contigs related to stress response. Unlike Nlap_annotated_GO.csv, however, Nlap_annotated_proteinnames.csv does not have a column with GOSlim bins, so I can't just awk it.
Three ideas:
- Some magic command similar to awk that matches a string from a different file
- Somehow match rows in Nlap_annotated_proteinnames.csv with rows in my other subsetted file (Nlap_annotated_GO_stress.csv) using their common SPIDs. This would be similar to the Vlookup function in Excel.
- Just merge my original Nlap_annotated_GO.csv and Nlap_annotated_proteinnames.csv in Excel based on SPID and then subset using awk. But I figure there might be a more elegant way to do it in Unix.
Thoughts? Thanks
— Reply to this email directly or view it on GitHub https://github.com/sr320/fish546-2015/issues/53.
@willking2 if you want to give me the url's of the two files in SQLShare I can show you what the code would look like
Thanks, I'll use SQLShare. Let me try to figure it out first and I'll post here if I run into trouble
I have two files, one of contigs annotated with GO information (
Nlap_annotated_GO.csv
) and one of contigs annotated with protein names (Nlap_annotated_proteinnames.csv
). They both have SPIDs.I've successfully made a subset of
Nlap_annotated_GO.csv
that contains only contigs related to stress response. I did this usingawk
:Now, I would like to make a subset of
Nlap_annotated_proteinnames.csv
that contains only contigs related to stress response. UnlikeNlap_annotated_GO.csv
, however,Nlap_annotated_proteinnames.csv
does not have a column with GOSlim bins, so I can't justawk
it.Three ideas:
awk
that matches a string from a different fileNlap_annotated_proteinnames.csv
with rows in my other subsetted file (Nlap_annotated_GO_stress.csv
) using their common SPIDs. This would be similar to the Vlookup function in Excel.Nlap_annotated_GO.csv
andNlap_annotated_proteinnames.csv
in Excel based on SPID and then subset usingawk
. But I figure there might be a more elegant way to do it in Unix.Thoughts? Thanks