hglanz / MetadataRepository_Summer2021

GNU General Public License v3.0
0 stars 0 forks source link

Summer Goal #1: Gather Metadata from Data Repositories #2

Open hglanz opened 3 years ago

hglanz commented 3 years ago

1) Learn and document how to access and obtain metadata for each of the data repositories 2) Write and contribute code to the repo for obtaining metadata

UCI Machine Learning Repository - done CORGIs - done

kingsuching commented 3 years ago

Hello Dr. Glanz,

I have written code to scrape from any dataset from the datahub.io repository. The .R file is in the main branch while the scraped data is in the Data folder. Thanks!

Sucheen

kingsuching commented 3 years ago

Hello Dr. Glanz,

I have made progress on writing code to scrape datasets from the specified repositories but am having trouble with iterating through the entire repository to obtain all the metadata. The functions I have written can extract the metadata from a single link, but I was mainly having trouble with finding the list of dataset links to iterate over. Can you please point me in the right direction?

Thanks, Sucheen S

hglanz commented 3 years ago

Unfortunately, it's going to vary from repository to repository. I'm going to use the CDC as an example, though.

Step 1) https://data.cdc.gov/ Step 2) Select a topic area (e.g. National Center for Health Statistics) Step 3) Select "Datasets" on the left "View Types" panel Step 4) Now we're on a page with a list of dataset page links.

A couple things to notice:

This process will vary and depend on the structure of each data repository, unfortunately. Let me know if this doesn't make sense or if you have any questions :)

kingsuching commented 3 years ago

Thank you for your feedback. I will try that soon.

kingsuching commented 3 years ago

Hey Dr. Glanz, I was writing code to scrape metadata from data.cdc.gov. I used rvest's read_html(url) %>% html_nodes(css_slector) %>% html_next procedure, and was not obtaining any of the scraped data. I tried this for different pages and css selectors. Can you please point me in the right direction? Thanks!