r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Update StackExchange Preprocessing. #52

Closed blester125 closed 9 months ago

blester125 commented 9 months ago

This PR updates how stack exchange preprocessing can be done.

It adds two new flags, --sort lets you control how answers are added to the document. --sort=time orders the answers by the date they were first posted. --sort=votes orders the answers by the score they get, with the "accepted answer" being the first in the list.

This PR also adds a --skip_comments flag, which allows one to skip adding the comments to the generated document as stackexchange seems to consider comment ephemeral.

It also adds a new metadata field which is the set of all licenses that appear on the comments/answers/the question that go into a single document. Then comments/answers are posted way after the original question, the version of the CC license can change.

These changes are based on discussions from the last meeting.

craffel commented 9 months ago

Thanks!!