jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Keep Citation Information as Annotations #3

Closed creisle closed 2 years ago

creisle commented 2 years ago

As discussed offline, it would be useful to be able to keep the in-text citation information as annotations in bioc format. I've had a crack at this as an offshoot of my tables PR #2 since it lays some groundwork that helps. I made this ticket for discussing the particulars.

creisle commented 2 years ago

Thus far I've been able to get this to keep citations by stripping out the actual citation and then attaching the annotation to the preceding non-whitespace token. Originally I was thinking maybe it should apply to the whole preceding sentence but that has 2 disadvantages

  1. We'd need to include a language model to split sentences or make assumptions about periods
  2. Sometimes the citation might not apply to the entire preceding sentence or may apply to more than one sentence so this would add more assumptions

what do you think? Does it make sense to just have it as the token right before the citation?

image