hannahshumway / vt-topicmodeling

Developing a topic modeling process for understanding the literature on The Prospects for Artificial Intelligence in Urban Planning.
0 stars 0 forks source link

"et al" and some proper names still showing up in text collocations #3

Open hannahshumway opened 2 years ago

hannahshumway commented 2 years ago

Should we manually remove some of these as stopwords or is there a more general way to go about it? Fixing some of the other issues here might make this issue less pressing (but not necessarily given all of the in-text citations).

ggordn3r commented 2 years ago

I like the idea of adding "et al" and last names as our own dictionary of stopwords generated from the references. Seems straightforward, reliable, and clean since the API is already returning structured JSON.

ggordn3r commented 2 years ago

Update: Let me make sure I understand the question and context. ReferenceChecker only gets rid of full references in the current iteration, leaving the in-text citations untouched, right? So this question around cleaning up in-text citations efficiently?

hannahshumway commented 2 years ago

Yes, that's correct. Assuming we can mitigate the separate issue of the few references that aren't extracting correctly, we'd still miss the 'et al' and common names from in-text citations. So, I agree it'd be useful to create our own extra list of stopwords. Any ideas on how best to do that with proper names (or if name stopword databases exist somewhere)?

hannahshumway commented 2 years ago

Actually, the question of removing names just reminded me of scrubadub, which might do the trick here.

ggordn3r commented 2 years ago

Good idea. Try scrubadub and if it does a good job of removing the names, then we're basically done. Just have to make sure years, "et al", and parentheses don't survive the cleaning.

If not, we could possibly write a function that extracts last names from the Scholarcy references, organizes them into a list, and appends that to the existing list of English stopwords.

If we know the paper's reference style, we could also generate the in-text citations programmatically (e.g. by copying code from an open-source reference manager or bib generator) and extend ReferenceChecker to delete those as well, but that would exacerbate the same efficiency issues mentioned in #2.