Prep and provide dataset for use in the project - Githubissues

medvidov / IMaSC

Intelligent Mission and Scientific Instrument Classification. Applying unique NLP approaches to improve information extraction through scientific papers/Foundry A-Team Studies.

Apache License 2.0

0 stars 1 forks source link

Prep and provide dataset for use in the project #7

Closed vc1492a closed 4 years ago

vc1492a commented 4 years ago

This dataset will be provided in a directory titled data and will contain textual, natural language data from scientific literature that relate to missions and instruments related to Earth and/or Space science.

vc1492a commented 4 years ago

Some dataset ideas:

Microwave Limb Sounder (MLS) Publications: this would include both the PDFs and the text extractions of scientific publications that use data from the MLS instrument (which is on board a spacecraft / mission). The number of instruments and missions would not be comprehensive but the data is very rich and it's a good place to start.
Earth Science Publications: this is not a dataset that I currently have assembled or researched by any means, but a natural extension of the above dataset is to expand the model training to a broader set of documents in the Earth sciences, with those missions and instruments, too.

vc1492a commented 4 years ago

@medvidov can you check out the data on the dev branch and let me know what you think? We can discuss in more detail early next week!

medvidov commented 4 years ago

@vc1492a I realize we discussed this earlier in the week, but had a quick question: given that more data can't hurt, where can I find the Earth Science Publications (if there is a general collection we could use)?

vc1492a commented 4 years ago

There is a tidy dataset already prepped to use - these PDFs and the associated text had to be manually generated on my part.

Thee's actually plenty of data in the original ~1200 parsed documents to use - your main bottleneck here will be the pace in which you are able to label data for training, testing, and the holdout dataset.

medvidov commented 4 years ago

Ok, sounds good. Can you add that data set and I will just note if I don't use it in the end?

vc1492a commented 4 years ago

"Earth Science Publications" was meant more generally - there isn't, at least to me, a known dataset for Earth science publications that may be out there available. If you want to expand the dataset you have with MLS, you'd need to crawl the web and obtain and parse the PDFs yourself as we did with the MLS data.

medvidov commented 4 years ago

As discussed during weekly check in, 1200 should be enough!