Closed ardunn closed 4 years ago
I'm going to bandaid this by only displaying materials that parse through the materials parser. Ultimately we should work on improving the accuracy of the ner model imo.
On Thu, Sep 26, 2019, 17:09 Alex Dunn notifications@github.com wrote:
On the search for application: light sensor, the last 3 results are [image: image] https://user-images.githubusercontent.com/19936203/65732808-5d769900-e080-11e9-9853-859e9e93eee8.png
@jdagdelen https://github.com/jdagdelen @AmalieT https://github.com/AmalieT
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/materialsintelligence/matscholar-web/issues/107?email_source=notifications&email_token=ALJL7BTGHPUJOIOTXRPCBGDQLVFLZA5CNFSM4I3AKR5KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HOAXPFQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ALJL7BTJT2Q5NXNESNMO2ELQLVFLZANCNFSM4I3AKR5A .
Similar materials that only appear in a single DOI are now pushed through the materials parser. It seems to have removed most of these bad results.
@AmalieT we still seem to have some leakage of the pesky "GmbH & co". E.g., on searching:"application: cathode" it says there are 126 relevant papers with GmbH & co as a material entity. There are also still some other entities which don't make sense to me, like "rGO" and "carbonaceous"
Similarly, when searching for materials with phase: cubic heusler,
The word "come" is registered as a material. The abstract is relevant (has to do with heuslers etc., which is good) but the entity which was extracted as a material was not an actual material
"rGO" is "reduced graphene oxide" so that is correct, but "come" is obviously an error (from an NER perspective). There are actually only about 20 materials that are not normalized to a chemical formula, things like graphene, steel etc., so other than these everything that cannot be parsed to a chemical formula is either an NER error (e.g., "come") or a parsing error (e.g., "rGA" - this one probably happened because the r is lowercase, and so it wasn't flagged as an acronym during parsing). Maybe anything that is an NER/parsing error could just be removed from the mongo collection?? That would fix most of these errors.
Similarly for "material: PbTe, application: thermoelectric":
The following entities are returned:
Things like "bi-doped" or "nonequilbrium" might have been labelled as a descriptor in the training set; definitely "bi-doped" was labelled as a descriptor originally, I can't remember if we went back and changed it. "Asin" is not a material but "AsIn" is, so maybe the character part of the LSTM got confused (unless Asin was accidentally changed to AsIn during preprocessing?).
On Mon, Sep 30, 2019 at 12:40 PM AmalieT notifications@github.com wrote:
@LeighWeston86 https://github.com/LeighWeston86 Did you ever happen to make a list of the non-formula materials?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/materialsintelligence/matscholar-web/issues/107?email_source=notifications&email_token=AEZGTNIFB63REXGMFUHNZRTQMJIZTA5CNFSM4I3AKR5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD763BYI#issuecomment-536719585, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZGTNPWWJC7FEZJSF5S52DQMJIZTANCNFSM4I3AKR5A .
Some more examples of this: "THE PHYSICAL SOCIETY" came up as an entity
after testing these queries again, this is no longer an issue
On the search for application: light sensor, the last 3 results are
@jdagdelen @AmalieT