materialsintelligence / matscholar-web

Code for the Materials Scholar website
http://matscholar.com
MIT License
9 stars 11 forks source link

Entities need filtering - nonsense returned on searches #107

Closed ardunn closed 4 years ago

ardunn commented 5 years ago

On the search for application: light sensor, the last 3 results are image

@jdagdelen @AmalieT

AmalieT commented 5 years ago

I'm going to bandaid this by only displaying materials that parse through the materials parser. Ultimately we should work on improving the accuracy of the ner model imo.

On Thu, Sep 26, 2019, 17:09 Alex Dunn notifications@github.com wrote:

On the search for application: light sensor, the last 3 results are [image: image] https://user-images.githubusercontent.com/19936203/65732808-5d769900-e080-11e9-9853-859e9e93eee8.png

@jdagdelen https://github.com/jdagdelen @AmalieT https://github.com/AmalieT

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/materialsintelligence/matscholar-web/issues/107?email_source=notifications&email_token=ALJL7BTGHPUJOIOTXRPCBGDQLVFLZA5CNFSM4I3AKR5KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HOAXPFQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ALJL7BTJT2Q5NXNESNMO2ELQLVFLZANCNFSM4I3AKR5A .

AmalieT commented 5 years ago

Similar materials that only appear in a single DOI are now pushed through the materials parser. It seems to have removed most of these bad results.

ardunn commented 5 years ago

@AmalieT we still seem to have some leakage of the pesky "GmbH & co". E.g., on searching:"application: cathode" it says there are 126 relevant papers with GmbH & co as a material entity. There are also still some other entities which don't make sense to me, like "rGO" and "carbonaceous"

ardunn commented 5 years ago

Similarly, when searching for materials with phase: cubic heusler,

image

The word "come" is registered as a material. The abstract is relevant (has to do with heuslers etc., which is good) but the entity which was extracted as a material was not an actual material

LeighWeston86 commented 5 years ago

"rGO" is "reduced graphene oxide" so that is correct, but "come" is obviously an error (from an NER perspective). There are actually only about 20 materials that are not normalized to a chemical formula, things like graphene, steel etc., so other than these everything that cannot be parsed to a chemical formula is either an NER error (e.g., "come") or a parsing error (e.g., "rGA" - this one probably happened because the r is lowercase, and so it wasn't flagged as an acronym during parsing). Maybe anything that is an NER/parsing error could just be removed from the mongo collection?? That would fix most of these errors.

ardunn commented 5 years ago

Similarly for "material: PbTe, application: thermoelectric":

The following entities are returned:

LeighWeston86 commented 5 years ago

Things like "bi-doped" or "nonequilbrium" might have been labelled as a descriptor in the training set; definitely "bi-doped" was labelled as a descriptor originally, I can't remember if we went back and changed it. "Asin" is not a material but "AsIn" is, so maybe the character part of the LSTM got confused (unless Asin was accidentally changed to AsIn during preprocessing?).

LeighWeston86 commented 5 years ago

https://github.com/materialsintelligence/matscholar-core/blob/master/matscholar_core/nlp/normalize.py

On Mon, Sep 30, 2019 at 12:40 PM AmalieT notifications@github.com wrote:

@LeighWeston86 https://github.com/LeighWeston86 Did you ever happen to make a list of the non-formula materials?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/materialsintelligence/matscholar-web/issues/107?email_source=notifications&email_token=AEZGTNIFB63REXGMFUHNZRTQMJIZTA5CNFSM4I3AKR5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD763BYI#issuecomment-536719585, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZGTNPWWJC7FEZJSF5S52DQMJIZTANCNFSM4I3AKR5A .

ardunn commented 5 years ago

Some more examples of this: "THE PHYSICAL SOCIETY" came up as an entity

ardunn commented 4 years ago

after testing these queries again, this is no longer an issue