Evaluate population of GSA with all CarbBank entries

ReneRanzinger commented 1 year ago

Evaluate the possibility for populating GSA with CarbBank data. A CSV with CarbBank data can be found here. The column abbreviations are explained here.

The first step would be to extract the biological data from the BS column. There are different information encoded in the column:

CN - species common name
GS - species scientific name
OT - organ type
LS - life stage
disease - Disease
...

There is no dictionary IDs for the values but all are free text. We will need to extract and map them to the corresponding dictionaries (NCBI Taxonomy, Disease ontology, UBERON etc.).

Have a look and let me know how to proceed. If you want me to change the format (add or remove the columns) or split the columns up, let me know.

kmartinez834 commented 1 year ago

@ReneRanzinger There are rows that contain multiple values, separated by line breaks. The affected columns (with data that can be ingested by GSA) are AM, BS, MT, PA, PM.

For example, the following entry starting at line 98 has two different biological sources listed: "16519","","",""," Takeuchi M; Takasaki S; Miyazaki H; Kato T; Hoshi S; Kochibe N; Kobata A",""," (CN) Chinese hamster, (OT) CHO cells (CN) human, (OT) urine"," J Biol Chem (1988) 263: 3657-3663"," 04-01-1992",""," N-linked glycoprotein recombinant glycoprotein",""," 1-2 Neup5Ac per molecule, urinary HuEPO has .alpha.2.fwdawr.3 and .alpha.2.fwdawr.6 linkages, rHuEPO has .alpha.2.fwdawr.3 linkages",""," EPO, erythropoietin, human EPO, erythropoietin, human, recombinant"," Kleen A"," AN1/1', 1' has fucose"," CBank:21607",""," Comparative study of the asparagine-linked sugar chains of human erythropoietins purified from urine and the culture medium of recombinant Chinese hamster ovary cells","","9769","G02718AK"

More human readable version of the above multi-line fields: BS	MT	PM
(CN) Chinese hamster, (OT) CHO cells	N-linked glycoprotein	EPO, erythropoietin, human
(CN) human, (OT) urine	recombinant glycoprotein	EPO, erythropoietin, human, recombinant

This is an example of glycans that were purified from both an expression system and naturally occurring human urine. Also, it does not appear that the multi-line entries in one column correspond to another.

Regarding format, some observations will need to be split into separate rows due to multiple biological sources or glycosylation sites. I think the BS column should be split into new columns per your suggestion, but only after we address the multi-line issue. Let me know your thoughts.

kmartinez834 commented 1 year ago

@ReneRanzinger Ignore rows with multi-lines in the following fields: AM, BS, MT, PA, PM

Some of these fields will be imported as keywords into GSA. Let's exclude the multi-line cases for now, but eventually we could replace the line breaks with a pipe or semicolon if they all apply to the same GSA record.

ReneRanzinger commented 1 year ago

I update the carbbank file: https://github.com/ReneRanzinger/org.glycomedb.export.glygen/blob/main/export/carbbank.csv.

Rows with multi line entries in BS, MT, PA, PM are filtered out. That leaves 35,869 of 49,897 records. The AM column is the experimental method. I linarized it and separate the different methods by "|". We allow multiple methods you just have to split it by "|".

I also split the BS field. I left the original column as is, but parsed the individual components into separate columns, which I appended after GlyTouCan:

BS-CN
BS-OT
BS-disease
BS-LS
BS-GS
BS-GT
BS-C
BS-*
BS-cell line
BS-K
BS-domain
BS-BS
BS-F
BS-O

kmartinez834 commented 1 year ago

Thanks @ReneRanzinger

If you have time, could you also replace the line breaks with pipes for the fields AN and DB?

ReneRanzinger commented 1 year ago

@kmartinez834 Ok, done.

kmartinez834 commented 1 year ago

@rykahsay here is the carbbank field and corresponding gsa field.

There will only be one tax_id per entry, so map using the most specific term. Order of terms from most broad to specific: domain, kingdom, class, family, common name, species

carbbank field	gsa field	processing notes
CC	database_source	URL is https://www.genome.jp/entry/carbbank+%s
AM	experimental_method	some entries have multiple, separated by pipes
AN	keywords	some entries have multiple, separated by pipes
AU	publication	Author, use for mapping
CT	publication	Citation info, use for mapping
DB	xrefs	some entries have multiple, separated by pipes
MT	keywords
PA	site	protein attachment site
PM	glycoprotein	Protein name, use for mapping
ST	evidence_type	Synthetic entry if the term "synthetic" is in this field
TI	publication	Paper title, use for mapping
GlycomeDB ID	xrefs
GlyTouCan Acc	xrefs
BS-CN	tax_name	common name
BS_OT	tissue
BS-disease	disease
BS-GS	tax_name	species
BS-GT	strain, serotype	field includes both strain and serotype
BS-C	tax_name	class
BS-K	tax_name	kingdom
BS-domain	tax_name	domain
BS-F	tax_name	family
BS-cell line	cell_type	mapping specific to carbbank file: carbbank_cell_lines.csv

Note: If a field is not listed above, disregard for now

glygener / glygen-backend-gsa

Evaluate population of GSA with all CarbBank entries #58