glygener / glygen-backend-gsa

0 stars 0 forks source link

Evaluate population of GSA with all CarbBank entries #58

Open ReneRanzinger opened 1 year ago

ReneRanzinger commented 1 year ago

Evaluate the possibility for populating GSA with CarbBank data. A CSV with CarbBank data can be found here. The column abbreviations are explained here.

The first step would be to extract the biological data from the BS column. There are different information encoded in the column:

There is no dictionary IDs for the values but all are free text. We will need to extract and map them to the corresponding dictionaries (NCBI Taxonomy, Disease ontology, UBERON etc.).

Have a look and let me know how to proceed. If you want me to change the format (add or remove the columns) or split the columns up, let me know.

kmartinez834 commented 1 year ago

@ReneRanzinger There are rows that contain multiple values, separated by line breaks. The affected columns (with data that can be ingested by GSA) are AM, BS, MT, PA, PM.

For example, the following entry starting at line 98 has two different biological sources listed: "16519","","",""," Takeuchi M; Takasaki S; Miyazaki H; Kato T; Hoshi S; Kochibe N; Kobata A",""," (CN) Chinese hamster, (OT) CHO cells (CN) human, (OT) urine"," J Biol Chem (1988) 263: 3657-3663"," 04-01-1992",""," N-linked glycoprotein recombinant glycoprotein",""," 1-2 Neup5Ac per molecule, urinary HuEPO has .alpha.2.fwdawr.3 and .alpha.2.fwdawr.6 linkages, rHuEPO has .alpha.2.fwdawr.3 linkages",""," EPO, erythropoietin, human EPO, erythropoietin, human, recombinant"," Kleen A"," AN1/1', 1' has fucose"," CBank:21607",""," Comparative study of the asparagine-linked sugar chains of human erythropoietins purified from urine and the culture medium of recombinant Chinese hamster ovary cells","","9769","G02718AK"

More human readable version of the above multi-line fields: BS MT PM
(CN) Chinese hamster, (OT) CHO cells N-linked glycoprotein EPO, erythropoietin, human
(CN) human, (OT) urine recombinant glycoprotein EPO, erythropoietin, human, recombinant

This is an example of glycans that were purified from both an expression system and naturally occurring human urine. Also, it does not appear that the multi-line entries in one column correspond to another.

Regarding format, some observations will need to be split into separate rows due to multiple biological sources or glycosylation sites. I think the BS column should be split into new columns per your suggestion, but only after we address the multi-line issue. Let me know your thoughts.

kmartinez834 commented 1 year ago

@ReneRanzinger Ignore rows with multi-lines in the following fields: AM, BS, MT, PA, PM

Some of these fields will be imported as keywords into GSA. Let's exclude the multi-line cases for now, but eventually we could replace the line breaks with a pipe or semicolon if they all apply to the same GSA record.

ReneRanzinger commented 1 year ago

I update the carbbank file: https://github.com/ReneRanzinger/org.glycomedb.export.glygen/blob/main/export/carbbank.csv.

Rows with multi line entries in BS, MT, PA, PM are filtered out. That leaves 35,869 of 49,897 records. The AM column is the experimental method. I linarized it and separate the different methods by "|". We allow multiple methods you just have to split it by "|".

I also split the BS field. I left the original column as is, but parsed the individual components into separate columns, which I appended after GlyTouCan:

kmartinez834 commented 1 year ago

Thanks @ReneRanzinger

If you have time, could you also replace the line breaks with pipes for the fields AN and DB?

ReneRanzinger commented 1 year ago

@kmartinez834 Ok, done.

kmartinez834 commented 1 year ago

@rykahsay here is the carbbank field and corresponding gsa field.

There will only be one tax_id per entry, so map using the most specific term. Order of terms from most broad to specific: domain, kingdom, class, family, common name, species

carbbank field gsa field processing notes
CC database_source URL is https://www.genome.jp/entry/carbbank+%s
AM experimental_method some entries have multiple, separated by pipes
AN keywords some entries have multiple, separated by pipes
AU publication Author, use for mapping
CT publication Citation info, use for mapping
DB xrefs some entries have multiple, separated by pipes
MT keywords
PA site protein attachment site
PM glycoprotein Protein name, use for mapping
ST evidence_type Synthetic entry if the term "synthetic" is in this field
TI publication Paper title, use for mapping
GlycomeDB ID xrefs
GlyTouCan Acc xrefs
BS-CN tax_name common name
BS_OT tissue
BS-disease disease
BS-GS tax_name species
BS-GT strain, serotype field includes both strain and serotype
BS-C tax_name class
BS-K tax_name kingdom
BS-domain tax_name domain
BS-F tax_name family
BS-cell line cell_type mapping specific to carbbank file: carbbank_cell_lines.csv

Note: If a field is not listed above, disregard for now