jakelever / cancermine

Text-mined knowledgebase for drivers, oncogenes and tumor suppressors in cancer
http://bionlp.bcgsc.ca/cancermine/
MIT License
40 stars 7 forks source link

Data update v29 - increase in unfiltered and drop in sentences #2

Closed sigven closed 3 years ago

sigven commented 4 years ago

Hi,

Quick question regarding data update October 2020. I notice a dramatic increase in the size of cancermine_unfiltered.tsv, and a drop in the cancermine_collated.tsv, not following the trends in previous releases. Is this as expected?

kind regards, Sigve

jakelever commented 4 years ago

Hi Sigve, yes, I made some changes to how the updates are done but it looks like it's had some unexpected consequences. I'm digging into it to see what needs to be fixed. Thanks for raising it.

jakelever commented 4 years ago

Hey, just a little update. I've been looking into this and have found a few issues. The most important one, which thanks to you I found, is that my new updating method failed to process the PubMed Central Author Manuscript Collection properly. That's a smaller subset of PubMed Central, but still a large number of papers. So thank you!

I'm working on a new update that fixes that and a few other smaller issues. Other sentences will disappear from the new update and I'm working through them to make sure that the reasons are okay. Because of the underlying parsing and ML libraries, and updates to the ontologies, we won't always get the exact same results out as these things are updated. The main reason that things may drop out is that the ML system scores them lower (due to changes to scikit-learn), and below the cutoffs that we've used to decide something is confident enough to be called.

I hope to have the new update out soon in the next week or two.

sigven commented 4 years ago

Thanks a lot for looking into this. Looking forward to the update. Great work!

Regards, Sigve

jakelever commented 3 years ago

Hey, got a new version up. And I dug into changes to make sure they were okay.

Briefly, for sentences that appear in v28 and not in v30:

Again, thanks for flagging the issue.

sigven commented 3 years ago

Cool, will check out tomorrow! Thanks again for a solid piece of work.

Regards, Sigve

stale[bot] commented 3 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.