Closed sigven closed 3 years ago
Hi Sigve, yes, I made some changes to how the updates are done but it looks like it's had some unexpected consequences. I'm digging into it to see what needs to be fixed. Thanks for raising it.
Hey, just a little update. I've been looking into this and have found a few issues. The most important one, which thanks to you I found, is that my new updating method failed to process the PubMed Central Author Manuscript Collection properly. That's a smaller subset of PubMed Central, but still a large number of papers. So thank you!
I'm working on a new update that fixes that and a few other smaller issues. Other sentences will disappear from the new update and I'm working through them to make sure that the reasons are okay. Because of the underlying parsing and ML libraries, and updates to the ontologies, we won't always get the exact same results out as these things are updated. The main reason that things may drop out is that the ML system scores them lower (due to changes to scikit-learn), and below the cutoffs that we've used to decide something is confident enough to be called.
I hope to have the new update out soon in the next week or two.
Thanks a lot for looking into this. Looking forward to the update. Great work!
Regards, Sigve
Hey, got a new version up. And I dug into changes to make sure they were okay.
Briefly, for sentences that appear in v28 and not in v30:
Again, thanks for flagging the issue.
Cool, will check out tomorrow! Thanks again for a solid piece of work.
Regards, Sigve
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Hi,
Quick question regarding data update October 2020. I notice a dramatic increase in the size of
cancermine_unfiltered.tsv
, and a drop in thecancermine_collated.tsv
, not following the trends in previous releases. Is this as expected?kind regards, Sigve