sec2pri / mapping_preprocessing

Other
0 stars 3 forks source link

backslash added to metabolite names #23

Open tabbassidaloii opened 1 year ago

tabbassidaloii commented 1 year ago

@DeniseSl22 In the metabolite files when there is ' or " in the metabolite name a \ has been added, see the example below:

Q425039 "bromcresol purple" "5',5\"-dibromo-o-cresolsulfophthalein"
DeniseSl22 commented 1 year ago

Yes, chemical names can be difficult..... But this is the actual name in the Wikidata entry: https://www.wikidata.org/wiki/Q425039

I believe the backslash is there, so that R can read the data, right? Or is that not working?

DeniseSl22 commented 1 year ago

I've now changes the entry, so the name is 5',5'' (so using the single quote twice, iso using a double quote). That might resolve things for this entry; I'll run the Wikidata GitHub action again. Any others which are causing issues @tabbassidaloii ?

tabbassidaloii commented 1 year ago

It messes up in R when there is a / It reads with no issue when I remove the unnecessary / manually. Also when there is a / in the name, an extra / would be added.

DeniseSl22 commented 1 year ago

Forward or backward slash, which one is causing the issue?

tabbassidaloii commented 1 year ago

backslashes are added when there is ', ", or a backslash (/).

tabbassidaloii commented 1 year ago

I removed them manually in the file, to keep the app running correctly for now.

DeniseSl22 commented 1 year ago

How many are there, could you give me some more examples? Then I can (try) to construct a regex to solve this.... using TSV should already solve issues around using a comma within a name...

tabbassidaloii commented 1 year ago

it is around 20 to 30, I think. Here are some more examples:

5',5\\"-dibromo-o-cresolsulfophthalein
Germa-Medica \\"Mg\\"
2,5-O,O-BIS-(3',3\\"-AMIDINOPHENYL)-1,4:3,6-DIANHYDRO-D-SORBITOL
diinosine-5',5\\"-pentaphosphate
3-O-beta-D-glucopyranosylpresenegenin 28-O-{[beta-D-apiofuranosyl(1-3)][beta-D-galactopyranosyl(1-4)-beta-D-xylopyranosyl(1-4)]-alpha-L-rhamnopyranosyl(1-3)}{4-O-(E)-4\\"-methoxycinnamoyl}-beta-D-fucopyranoside
3-O-beta-D-glucopyranosylpresenegenin 28-O-{[beta-D-apiofuranosyl(1-3)][beta-D-galactopyranosyl(1-4)-beta-D-xylopyranosyl(1-4)]-alpha-L-rhamnopyranosyl(1-3)}{4-O-(Z)-4\\"-methoxycinnamoyl}-beta-D-fucopyranosid
alpha-GalCer-6\\"-(pyridin-4-yl)carbamate
alpha-GalCer-6\\"-(4-pyridyl)carbamate
2'-[2\\"-(5'''-phosphoribosyl)-5\\"-phosphoribosyl]adenosine 5'-monophosphate
2'-[2\\"-(1'''-ribosyl)-1\\"-ribosyl]adenosine 5',5\\",5'''-tris(phosphate)
alpha-GalCer-6\\"(1-naphthyl)carbamate
2'-(5\\"-phosphoribosyl)adenosine 5'-monophosphate
2'-(1\\"-ribosyl)adenosine 5',5\\"-bis(phosphate)
alpha-GalCer-6\\"-(4-chlorophenyl)carbamate
N,N',N\\"-trimethyl-1,4,7-triazacyclononane
12,24-Dihydro-5H-naphtho[2,3-h]naphth[2\\",3\\":6,7]anthra[2,1,9-mna]acridine-5,10,13,18,25-pentone
DeniseSl22 commented 1 year ago

When I filter the results with a regex in the query of the names itself "FILTER(REGEX(?name, "['\"\/]", "i")).", I get: 96963 results in 29005 ms.

We could either filter these out before we obtain the data (changing the SPARQL query), or find a way so that R can read these (e.g. like this).... I don't think we should replace these characters with another one (besides changing the double quote to two single quotes).

What's your preference @tabbassidaloii ?

tabbassidaloii commented 1 year ago

@DeniseSl22 I don't understand why you filter the results. The backslashes are to the outputs of the queries (and I am not sure how to avoid it) so I would try to solve it before reading the file in R. for example, Germa-Medica "Mg" in the output of query instead of Germa-Medica \\"Mg\\".

There are options in R to remove them, but I have concerns about causing unnecessary changes in the other values (e.g. there are metabolites with a backslash (/) in their names, and we should keep those backslashes). So I would fix it in bash script.

DeniseSl22 commented 1 year ago

When I look at the raw Github data, I don't see this issue.... I think the reading of the file in the code of the Shiny App is not going correctly (something like this line:

dataset <- data.table::fread("processed_mapping_files/HGNC_secID2priID.tsv")

Might need to be adapted? We could maybe have a look at this together (I have some time on Wednesday morning) @tabbassidaloii Just to get a clearer idea of where the issue is coming from...

tabbassidaloii commented 1 year ago

This is weird because I see it in the files I downloaded from GitHub, before opening them in R. Yeah, let's meet either at 9 or at 11 am.

DeniseSl22 commented 1 year ago

As discussed:

tabbassidaloii commented 3 months ago

@DeniseSl22 if #9 is solved we can work on #106 considering below

As discussed:

  • [ ] Not push Wikidata data into Github, since this added backslashes which doesn't work in R for reading the data.
  • [ ] Create a logfile, counting the lines or (unique) wordcount for each file (bash), push logfile to Github.
  • [ ] Compare logfile of previous release to new one, and print out results to terminal (bash)