sr320 / course-fish546-2015

Bioinformatics for Environmental Sciences
https://github.com/sr320/fish546-2015/wiki
4 stars 8 forks source link

awk across files #53

Closed willking2 closed 9 years ago

willking2 commented 9 years ago

I have two files, one of contigs annotated with GO information (Nlap_annotated_GO.csv) and one of contigs annotated with protein names (Nlap_annotated_proteinnames.csv). They both have SPIDs.

I've successfully made a subset of Nlap_annotated_GO.csv that contains only contigs related to stress response. I did this using awk:

$ awk -F"," '/[Ss]tress response/ {print $0}' Nlap_annotated_GO.csv > Nlap_annotated_GO_stress.csv

Now, I would like to make a subset of Nlap_annotated_proteinnames.csv that contains only contigs related to stress response. Unlike Nlap_annotated_GO.csv, however, Nlap_annotated_proteinnames.csv does not have a column with GOSlim bins, so I can't just awk it.

Three ideas:

Thoughts? Thanks

sr320 commented 9 years ago

Those might all work - the first thing that comes to mind is a join though. Either before or after you subset stress response. The join can be done in SQLShare or bash.

Technically the subsetting can also be done in SQLShare too; so in my mind this would be most elegant as you generated the files there.

Steven Roberts faculty.washington.edu/sr320

On Thu, Feb 19, 2015 at 9:53 PM, Will King notifications@github.com wrote:

I have two files, one of contigs annotated with GO information ( Nlap_annotated_GO.csv https://github.com/willking2/fish546_W15/blob/master/nlap-ano/products/Nlap_annotated_GO.csv) and one of contigs annotated with protein names ( Nlap_annotated_proteinnames.csv https://github.com/willking2/fish546_W15/blob/master/nlap-ano/products/Nlap_annotated_proteinnames.csv). They both have SPIDs.

I've successfully made a subset of Nlap_annotated_GO.csv that contains only contigs related to stress response. I did this using awk:

$ awk -F"," '/[Ss]tress response/ {print $0}' Nlap_annotated_GO.csv > Nlap_annotated_GO_stress.csv

Now, I would like to make a subset of Nlap_annotated_proteinnames.csv that contains only contigs related to stress response. Unlike Nlap_annotated_GO.csv, however, Nlap_annotated_proteinnames.csv does not have a column with GOSlim bins, so I can't just awk it.

Three ideas:

  • Some magic command similar to awk that matches a string from a different file
  • Somehow match rows in Nlap_annotated_proteinnames.csv with rows in my other subsetted file (Nlap_annotated_GO_stress.csv) using their common SPIDs. This would be similar to the Vlookup function in Excel.
  • Just merge my original Nlap_annotated_GO.csv and Nlap_annotated_proteinnames.csv in Excel based on SPID and then subset using awk. But I figure there might be a more elegant way to do it in Unix.

Thoughts? Thanks

— Reply to this email directly or view it on GitHub https://github.com/sr320/fish546-2015/issues/53.

sr320 commented 9 years ago

@willking2 if you want to give me the url's of the two files in SQLShare I can show you what the code would look like

willking2 commented 9 years ago

Thanks, I'll use SQLShare. Let me try to figure it out first and I'll post here if I run into trouble