Teaching a Computer to Read Science

Imperssonator commented 8 years ago

Me

Nils Persson

Abstract

People in all scientific disciplines spend an incredible amount of time reading and preparing published figures and graphs. Somewhere between 60-80% of all published scientific data is trapped in a .jpg file or a PDF somewhere on a publisher's server. We live in an era of learning from big data, yet arguably the biggest, most validated, most scrutinized data collected by humankind is hiding in images. Automated extraction of data from images of figures and graphs has received a smattering of attention over the past decades, but the time has never been more ripe for this technology to come to the fore. With a dataset of hundreds of figures from open-access journals, how much information can we extract armed with just a Jupyter Notebook and computer vision libraries? The answer: quite a lot.

Affiliation

Georgia Institute of Technology

About Me

Nils is a chemical engineering PhD student at the Georgia Institute of Technology whose thesis could best be compared to a wayward fishing vessel lost in the middle of the Pacific. It will eventually beach on a deserted island, which will be deemed novel regardless of others having discovered it before.

bollwyvl commented 8 years ago

@Imperssonator Loved the preview of this at pydata atlanta! While obviously a technical topic, I think the impact of this work (perhaps with some librarification) will be very practical for a lot of attendees.

Do you see this as a standalone evolution of the pydatatl lightning talk, or do you want to figure out how to bring this into a more "applied" context? We haven't fixed the PM workshop topics yet, but I could see this being very applicable when applied to some non-traditional datasets... thinking "data journalism", "public health & medicine" or some other stuff that would push your hypothesis a little outside your comfort zone.

Imperssonator commented 8 years ago

Yeah, it is loosely libraried right now but that would be a good goal for bringing it to an audience. I'd be interested in applying it to some other random datasets, too. I don't know what the workshops will look like but this could make for a good one - we have a large dataset that people could work on. I think there are a lot of general image processing strategies that I could introduce people to in a workshop setting.

Imperssonator commented 8 years ago

For a workshop, I would probably go like this:

Image data structures

Thresholding and Binarization methods

Operations on binary images:

Dilation, erosion
connected component analysis
skeletonization

That right there is a solid hour of material on morphological image processing.

Imperssonator commented 8 years ago

And that's not including the initial half hour of getting PIL/pillow correctly installed on everyone's machine... what a nightmare...

tonyfast commented 8 years ago

alright, @Imperssonator: we've got you in the Notebooks for Science workshop. Stay tuned for what that means!

jupyterday-atlanta-2016 / proposals