squaresLab / ros-answers-miner

A web scraper for ROS Answers
Apache License 2.0
2 stars 2 forks source link

Why scrape? #8

Open gavanderhoorn opened 1 year ago

gavanderhoorn commented 1 year ago

Hi guys. Interesting project.

I was curious as to why you're using web scraping to get ROS Answers content? IIRC, there is support for exporting/dumping the database (using a web API) in a relatively usable format. That would seem to allow more convenient processing of it.

The dump / API access was used by @DLu to create the ROS Answers section of metrics.ros.org (source).

Perhaps he could say something as to whether that could also be made available for scientific research purposes.

DLu commented 1 year ago

Gladly: http://metrorobots.com/answers.db

gavanderhoorn commented 1 year ago

@DLu: does that also contain Q&A content? 170MB seems small for the entirety of ROS Answers?


Edit: looks like it does.

DLu commented 1 year ago

Sorta, the database structure is here: https://github.com/DLu/ros_metrics/blob/main/data/answers.yaml

The question title/summary is included.

The answer text is not.

gavanderhoorn commented 1 year ago

The answer text is not.

ah, hm.

So that might still need scraping then.

Would you know of a way to retrieve the answer bodies as well, without scraping? This must exist right?

DLu commented 1 year ago

https://github.com/ASKBOT/askbot-devel/pull/828

Been there, done that.

DLu commented 1 year ago

https://answers.ros.org/api/v1/answers/13122/

pcanelas commented 1 year ago

Hello everyone,

I had no idea that this API existed, thank you so much @gavanderhoorn and @DLu!

@DLu I was wondering, I noticed in the database structure that it provides a summary of the question content and not the entire content of the question, and also the comments seem to be missing. Is it possible to also obtain this information using the API?

gavanderhoorn commented 1 year ago

@DLu wrote:

Gladly: http://metrorobots.com/answers.db

@DLu: when was that .db created/copied/downloaded? Trying some toy SQL queries and I can't get it to return the same nrs answers.ros.org shows.

Either my SQL is crap incorrect (very much possible) or the .db is not up-to-date?

DLu commented 1 year ago

@DLu I was wondering, I noticed in the database structure that it provides a summary of the question content and not the entire content of the question

I think the field is just named summary, but its actually the whole text. See https://answers.ros.org/api/v1/questions/408502/

and also the comments seem to be missing. Is it possible to also obtain this information using the API?

Last I checked, no

@DLu: when was that .db created/copied/downloaded? Trying some toy SQL queries and I can't get it to return the same nrs answers.ros.org shows.

I would have guessed the beginning of April. How off are the numbers you're getting?

gavanderhoorn commented 1 year ago

Somewhat off-topic perhaps, but the following query (5184 is my user id):

select id from answers where user_id == 5184

returns 3479 for me. ROS Answers says (as of today) 3517.

I also can't get the total karma to match what ROS Answers shows, but that's not really important.

DLu commented 1 year ago

My local copy says 3506 so it doesn't seem that off. I'll believe that you have 11 answers since I updated the database.