mhausenblas / schema-org-rdf

Schema.org in RDF
http://schema.rdfs.org
187 stars 57 forks source link

Cannot scrap Schema.org #55

Open EMegamanu opened 10 years ago

EMegamanu commented 10 years ago

Scraping Schema.org classes and properties into csv files does not work at this time.

I got the following stacktrace : $> python scrape_csv.py classes.csv properties.csv Traceback (most recent call last): File "scrape_csv.py", line 12, in types = schema_scraper.get_all_types() File "/Users/emmanuel/Downloads/schema-org-rdf-master/scrapers/schema_scraper.py", line 20, in get_all_types types[id] = get_type_details(base_url + id) File "/Users/emmanuel/Downloads/schema-org-rdf-master/scrapers/schema_scraper.py", line 49, in get_type_details id = ancestor_links[-1].text_content() IndexError: list index out of range

scor commented 10 years ago

There was a massive change in the HTML markup on schema.org.

http://schema.org/docs/schema_org_rdfa.html is the canonical schema used to generate all the type and property pages on schema.org, maybe you could scrape that one instead if it works for your use case (either scraping the HTML or parsing the RDFa into RDF and generating CSV from there).

EMegamanu commented 10 years ago

It seems working... but the generated files contain only the labels line.

Did I miss something ?