Closed vc1492a closed 4 years ago
Some dataset ideas:
@medvidov can you check out the data on the dev branch and let me know what you think? We can discuss in more detail early next week!
@vc1492a I realize we discussed this earlier in the week, but had a quick question: given that more data can't hurt, where can I find the Earth Science Publications (if there is a general collection we could use)?
There is a tidy dataset already prepped to use - these PDFs and the associated text had to be manually generated on my part.
Thee's actually plenty of data in the original ~1200 parsed documents to use - your main bottleneck here will be the pace in which you are able to label data for training, testing, and the holdout dataset.
Ok, sounds good. Can you add that data set and I will just note if I don't use it in the end?
"Earth Science Publications" was meant more generally - there isn't, at least to me, a known dataset for Earth science publications that may be out there available. If you want to expand the dataset you have with MLS, you'd need to crawl the web and obtain and parse the PDFs yourself as we did with the MLS data.
As discussed during weekly check in, 1200 should be enough!
This dataset will be provided in a directory titled
data
and will contain textual, natural language data from scientific literature that relate to missions and instruments related to Earth and/or Space science.