lgatto / pRolocdata

Data accompanying the pRoloc package
5 stars 5 forks source link

Add data from Orre et al 2019 #41

Closed lgatto closed 5 years ago

lgatto commented 5 years ago

SubCellBarCode: Proteome-wide Mapping of Protein Localization and Relocalization

Subcellular localization is a main determinant of protein function; however, a global view of cellular proteome organization remains relatively unexplored. We have developed a robust mass spectrometry-based analysis pipeline to generate a proteome-wide view of subcellular localization for proteins mapping to 12,418 individual genes across five cell lines. Based on more than 83,000 unique classifications and correlation profiling, we investigate the effect of alternative splicing and protein domains on localization, complex member co-localization, cell-type-specific localization, as well as protein relocalization after growth factor inhibition. Our analysis provides information about the cellular architecture and complexity of the spatial organization of the proteome; we show that the majority of proteins have a single main subcellular location, that alternative splicing rarely affects subcellular location, and that cell types are best distinguished by expression of proteins exposed to the surrounding environment. The resource is freely accessible via www.subcellbarcode.org.

https://www.cell.com/molecular-cell/fulltext/S1097-2765(18)31005-0

lgatto commented 5 years ago

Ping @ococrook - do you want to do this? Is so, feel free to assign the issue to yourself.

ococrook commented 5 years ago

@lgatto The data are provided with gene names rather than Uniport ID's. Do you recommend a reliable tool to convert them - I imagine you can use biomart?

lgatto commented 5 years ago

Either biomart or you can directly use uniprot at https://www.uniprot.org/uploadlists/

ococrook commented 5 years ago

What naming convention should we use the the dataset? We have Orre2019 following convention, but there are 9 different datasets, so cell-line should be in title. Do we want A431Orre2019 or a431Orre2019 or Orre2019A431 etc.

lgatto commented 5 years ago

I would prefer orre2019 followed by the cell line, every letter lowercase:

orre2019a431
orre2019mcf7
...
ococrook commented 5 years ago

Thanks, Laurent from above. I have some cases where gene names have NA as uniprot ID - any preference what to do in these cases?

lgatto commented 5 years ago

You could try to figure out what version of Uniprot they used, which could possibly help.

Otherwise

  1. check if the second gene name has an identifier;
  2. use the gene names as feature names, but add the uniprot identifiers that you have as a feature variable, in case they are needed.

Option 2 above is the fastest and is sensible.

ococrook commented 5 years ago

I'll probably aim for option 2. Uniport keeps crashing too - probably asking for too many proteins at a time. Thanks laurent!