journal suggestion returns high confidence for gibberish?

ardunn commented 5 years ago

Maybe I am misunderstanding what the confidence number means, but...

If I try a real abstract:

abstract = "Nd-rich phase plays a critical role in wetting grain boundary and facilitating texture formation for hot deformed (HD) Nd-Fe-B magnets. In this study, a non-uniform distribution of Nd-rich phase with dimension up to a few micrometers was observed in nanocrystalline HD magnets. The aggregation of the Nd-rich phase is confirmed to result from the low density precursor prepared by spark plasma sintering (SPS). The large local demagnetizing fields induced by Nd-rich phase aggregation led to the open recoil loops and reduced coercivity. Upon reducing recoil loop openness by eliminating Nd-rich phase aggregation, the coercivity of the HD magnet was significantly improved from 226 kA/m to 995 kA/m, and a high maximum energy product of 293 kJ/m3 was obtained. The dependences of microstructure and coercivity on the recoil loop characteristics suggest an essential approach for improving the magnetic properties of nanocrystalline HD Nd-Fe-B magnets."

res = rester.get_journal_suggestion(abstract)
print(res)

The top results have ~57% confidence.

If I try some nonsense,

nonsense = "This is some nonsense abstract. I shouldn't get anything real back, or maybe it should throw an error?"

res = rester.get_journal_suggestion(nonsense)
print(res)

The top results have 66% confidence.

If I try some random characters,

nonsense2 = "kaj dqwbr239hr23pi wfAFASFEF@#TFVSFDWR@##*#$(%)^($#*@#($%"
res = rester.get_journal_suggestion(nonsense2)
print(res)

The top results have ~60% confidence.

How can nonsense text produce journal suggestions with greater confidence than actual abstracts? What does the confidence number actually mean?

ardunn commented 5 years ago

@kevinyang8 @jdagdelen

jdagdelen commented 5 years ago

Out of curiosity, what journals are suggested for the nonsense?

jdagdelen commented 5 years ago

Confidence is the document vector cosine similarity. We can probably omit it from the results table.

ardunn commented 5 years ago

The nonsense (yet grammatical) sentences get:

[['Biometric Technology Today', 0.6736129522323608], ['Journal of Materials Science Letters', 0.6650767922401428], ['She Ji: The Journal of Design, Economics, and Innovation', 0.6291261911392212], ['Design Studies', 0.6281062960624695], ['Applied Mathematics and Computation', 0.62021803855896], ['Annual Reviews in Control', 0.6178011894226074], ['Mathematical and Computer Modelling', 0.615909218788147], ['Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics', 0.6036757826805115], ['IFAC-PapersOnLine', 0.6024499535560608], ['Procedia Manufacturing', 0.5946465730667114]]

The random characters get

[['Journal of the Taiwan Institute of Chemical Engineers', 0.603428304195404], ['Materials Chemistry and Physics', 0.5500685572624207], ['Journal of Materials Science Letters', 0.5484240651130676], ['International Journal of Hydrogen Energy', 0.5217562913894653], ['Applied Thermal Engineering', 0.5169612169265747], ['Materials Letters', 0.5102846026420593], ['Energy', 0.5096801519393921], ['Applied Energy', 0.5080447793006897], ['Chemical Engineering Journal', 0.5075826644897461], ['South African Journal of Chemical Engineering', 0.5050851106643677]]

kevinyang8 commented 5 years ago

I agree we could probably omit the cosine similarity from the results table, but I think we should still display some metric of similarity because the gap between some of them could be very high

kevinyang8 commented 5 years ago

Also there's not really a way to enforce that the input is an actual abstract rather than gibberish. With so many classes of journals, it might be the case that regardless of what you put in, the generated document embedding will have some cosine similarity scores that will be relatively high with some with journals. Open to suggestions if anyone has an idea to deal with this.

computron commented 5 years ago

I just wouldn't call it "confidence". It's not a calibrated confidence interval or anything. Just rename that column to "score" which sounds more arbitrary.

Or, as John suggested, just provide a ranked list. Frankly, the scores are not so meaningful as Alex pointed out

computron commented 5 years ago

(note that the Rester should certainly return the scores - just call them scores and not any kind of confidence; the web site I could go either way)

computron commented 5 years ago

Also - as to why the confidence is high for jibberish, I would guess that this has to do with use of cosine similarity. You would probably see a good difference in score between jibberish and real text if you switched to Euclidean distance. @kevinyang8 should test this and there are many ways to refine the distance metric if this seems to be the case.

computron commented 5 years ago

I'd also round the scores to either 0 or 1 decimal places. It's definitely not a precise enough measure to be giving people 2 decimal places

materialsintelligence / matscholar-web

journal suggestion returns high confidence for gibberish? #164