pushshift / api

Pushshift API
1.29k stars 109 forks source link

selftext_html? #8

Open Zuur opened 7 years ago

Zuur commented 7 years ago

There are a few fields missing from the submission results - selftext_html, created, ups, downs, etc. I'm curious why these fields aren't available but I'm after selftext_html in particular. If it's not available at all can you suggest the easiest way to produce it?

pushshift commented 7 years ago

Hi Aaron!

Are you talking about the online API or the Reddit dumps? The ups and downs aren't included because Reddit masks the downvotes and ups are the same as the score (so those two fields are either useless or redundant). The created field is the same as created_utc. The selftext_html field is redundant because the selftext field has all the info. The selftext_html field was just marked up HTML which you can reproduce using any of the available markup libraries (https://github.com/gamefreak/snuownd)

Thanks!

On Sun, Oct 29, 2017 at 12:17 AM, Aaron notifications@github.com wrote:

There are a few fields missing from the submission results - selftext_html, created, ups, downs, etc. I'm curious why these fields aren't available but I'm after selftext_html in particular. If it's not available at all can you suggest the easiest way to produce it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pushshift/api/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AMBTfVApyhiZvg2Em7sVfztx6Nu6cN0oks5sw_xegaJpZM4QKLup .

-- Jason Michael Baumgartner pushshift.io http://pushshift.io

Zuur commented 7 years ago

Hi Jason, thanks for the quick response!

I'm using the api.

I've always seen created to be about 13 hours ahead of created_utc or something? - I've previously used created in my code (mainly because the previous reddit api cloudsearch used this date rather than created_utc), but have now updated all my code to use created_utc everywhere.

I've now implemented a markdown parser on the server side as I use the html for various reasons in the back end as well as the front end. This is working in a fashion but doesn't give exactly the same results as I had previously and still needs some work. It just seems like unnecessary overheads to need to implement a parser in the back or front end as well as any extra processing time when the parsed html was already there to begin with?

I can understand you wanting to keep your responses shorter where possible though. If you retain the selftext_html (and other missing fields) in your database would it make sense to return them if specifically requested through the fields parameter?

pushshift commented 7 years ago

Do you mean reconstruct the self_html programmatically within the API? That's probably something worth doing -- I just need a Python version of the markup. If you can help me locate one, I'll add it quickly.

Thanks!

On Sun, Oct 29, 2017 at 4:15 PM, Aaron notifications@github.com wrote:

Hi Jason, thanks for the quick response!

I'm using the api.

I've always seen created to be about 13 hours ahead of created_utc or something? - I've previously used created in my code (mainly because the previous reddit api cloudsearch used this date rather than created_utc), but have now updated all my code to use created_utc everywhere.

I've now implemented a markdown parser on the server side as I use the html for various reasons in the back end as well as the front end. This is working in a fashion but doesn't give exactly the same results as I had previously and still needs some work. It just seems like unnecessary overheads to need to implement a parser in the back or front end as well as any extra processing time when the parsed html was already there to begin with?

I can understand you wanting to keep your responses shorter where possible though. If you retain the selftext_html (and other missing fields) in your database would it make sense to return them if specifically requested through the fields parameter?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pushshift/api/issues/8#issuecomment-340290754, or mute the thread https://github.com/notifications/unsubscribe-auth/AMBTfQyjvCLhtAhvUuI9i9RWf8EC3q5Qks5sxNz5gaJpZM4QKLup .

-- Jason Michael Baumgartner pushshift.io http://pushshift.io