Open tabbassidaloii opened 1 year ago
Yes, chemical names can be difficult..... But this is the actual name in the Wikidata entry: https://www.wikidata.org/wiki/Q425039
I believe the backslash is there, so that R can read the data, right? Or is that not working?
I've now changes the entry, so the name is 5',5'' (so using the single quote twice, iso using a double quote). That might resolve things for this entry; I'll run the Wikidata GitHub action again. Any others which are causing issues @tabbassidaloii ?
It messes up in R when there is a /
It reads with no issue when I remove the unnecessary /
manually.
Also when there is a /
in the name, an extra /
would be added.
Forward or backward slash, which one is causing the issue?
backslashes are added when there is '
, "
, or a backslash (/
).
I removed them manually in the file, to keep the app running correctly for now.
How many are there, could you give me some more examples? Then I can (try) to construct a regex to solve this.... using TSV should already solve issues around using a comma within a name...
it is around 20 to 30, I think. Here are some more examples:
5',5\\"-dibromo-o-cresolsulfophthalein
Germa-Medica \\"Mg\\"
2,5-O,O-BIS-(3',3\\"-AMIDINOPHENYL)-1,4:3,6-DIANHYDRO-D-SORBITOL
diinosine-5',5\\"-pentaphosphate
3-O-beta-D-glucopyranosylpresenegenin 28-O-{[beta-D-apiofuranosyl(1-3)][beta-D-galactopyranosyl(1-4)-beta-D-xylopyranosyl(1-4)]-alpha-L-rhamnopyranosyl(1-3)}{4-O-(E)-4\\"-methoxycinnamoyl}-beta-D-fucopyranoside
3-O-beta-D-glucopyranosylpresenegenin 28-O-{[beta-D-apiofuranosyl(1-3)][beta-D-galactopyranosyl(1-4)-beta-D-xylopyranosyl(1-4)]-alpha-L-rhamnopyranosyl(1-3)}{4-O-(Z)-4\\"-methoxycinnamoyl}-beta-D-fucopyranosid
alpha-GalCer-6\\"-(pyridin-4-yl)carbamate
alpha-GalCer-6\\"-(4-pyridyl)carbamate
2'-[2\\"-(5'''-phosphoribosyl)-5\\"-phosphoribosyl]adenosine 5'-monophosphate
2'-[2\\"-(1'''-ribosyl)-1\\"-ribosyl]adenosine 5',5\\",5'''-tris(phosphate)
alpha-GalCer-6\\"(1-naphthyl)carbamate
2'-(5\\"-phosphoribosyl)adenosine 5'-monophosphate
2'-(1\\"-ribosyl)adenosine 5',5\\"-bis(phosphate)
alpha-GalCer-6\\"-(4-chlorophenyl)carbamate
N,N',N\\"-trimethyl-1,4,7-triazacyclononane
12,24-Dihydro-5H-naphtho[2,3-h]naphth[2\\",3\\":6,7]anthra[2,1,9-mna]acridine-5,10,13,18,25-pentone
When I filter the results with a regex in the query of the names itself "FILTER(REGEX(?name, "['\"\/]", "i")).", I get: 96963 results in 29005 ms.
We could either filter these out before we obtain the data (changing the SPARQL query), or find a way so that R can read these (e.g. like this).... I don't think we should replace these characters with another one (besides changing the double quote to two single quotes).
What's your preference @tabbassidaloii ?
@DeniseSl22
I don't understand why you filter the results.
The backslashes are to the outputs of the queries (and I am not sure how to avoid it) so I would try to solve it before reading the file in R.
for example, Germa-Medica "Mg"
in the output of query instead of Germa-Medica \\"Mg\\"
.
There are options in R to remove them, but I have concerns about causing unnecessary changes in the other values (e.g. there are metabolites with a backslash (/
) in their names, and we should keep those backslashes). So I would fix it in bash script.
When I look at the raw Github data, I don't see this issue.... I think the reading of the file in the code of the Shiny App is not going correctly (something like this line:
dataset <- data.table::fread("processed_mapping_files/HGNC_secID2priID.tsv")
Might need to be adapted? We could maybe have a look at this together (I have some time on Wednesday morning) @tabbassidaloii Just to get a clearer idea of where the issue is coming from...
This is weird because I see it in the files I downloaded from GitHub, before opening them in R. Yeah, let's meet either at 9 or at 11 am.
As discussed:
@DeniseSl22 if #9 is solved we can work on #106 considering below
As discussed:
- [ ] Not push Wikidata data into Github, since this added backslashes which doesn't work in R for reading the data.
- [ ] Create a logfile, counting the lines or (unique) wordcount for each file (bash), push logfile to Github.
- [ ] Compare logfile of previous release to new one, and print out results to terminal (bash)
@DeniseSl22 In the metabolite files when there is
'
or"
in the metabolite name a\
has been added, see the example below: