Closed ardunn closed 5 years ago
@kevinyang8 @jdagdelen
Out of curiosity, what journals are suggested for the nonsense?
Confidence is the document vector cosine similarity. We can probably omit it from the results table.
The nonsense (yet grammatical) sentences get:
[['Biometric Technology Today', 0.6736129522323608], ['Journal of Materials Science Letters', 0.6650767922401428], ['She Ji: The Journal of Design, Economics, and Innovation', 0.6291261911392212], ['Design Studies', 0.6281062960624695], ['Applied Mathematics and Computation', 0.62021803855896], ['Annual Reviews in Control', 0.6178011894226074], ['Mathematical and Computer Modelling', 0.615909218788147], ['Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern Physics', 0.6036757826805115], ['IFAC-PapersOnLine', 0.6024499535560608], ['Procedia Manufacturing', 0.5946465730667114]]
The random characters get
[['Journal of the Taiwan Institute of Chemical Engineers', 0.603428304195404], ['Materials Chemistry and Physics', 0.5500685572624207], ['Journal of Materials Science Letters', 0.5484240651130676], ['International Journal of Hydrogen Energy', 0.5217562913894653], ['Applied Thermal Engineering', 0.5169612169265747], ['Materials Letters', 0.5102846026420593], ['Energy', 0.5096801519393921], ['Applied Energy', 0.5080447793006897], ['Chemical Engineering Journal', 0.5075826644897461], ['South African Journal of Chemical Engineering', 0.5050851106643677]]
I agree we could probably omit the cosine similarity from the results table, but I think we should still display some metric of similarity because the gap between some of them could be very high
Also there's not really a way to enforce that the input is an actual abstract rather than gibberish. With so many classes of journals, it might be the case that regardless of what you put in, the generated document embedding will have some cosine similarity scores that will be relatively high with some with journals. Open to suggestions if anyone has an idea to deal with this.
I just wouldn't call it "confidence". It's not a calibrated confidence interval or anything. Just rename that column to "score" which sounds more arbitrary.
Or, as John suggested, just provide a ranked list. Frankly, the scores are not so meaningful as Alex pointed out
(note that the Rester should certainly return the scores - just call them scores and not any kind of confidence; the web site I could go either way)
Also - as to why the confidence is high for jibberish, I would guess that this has to do with use of cosine similarity. You would probably see a good difference in score between jibberish and real text if you switched to Euclidean distance. @kevinyang8 should test this and there are many ways to refine the distance metric if this seems to be the case.
I'd also round the scores to either 0 or 1 decimal places. It's definitely not a precise enough measure to be giving people 2 decimal places
Maybe I am misunderstanding what the confidence number means, but...
If I try a real abstract:
The top results have ~57% confidence.
If I try some nonsense,
The top results have 66% confidence.
If I try some random characters,
The top results have ~60% confidence.
How can nonsense text produce journal suggestions with greater confidence than actual abstracts? What does the confidence number actually mean?