openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
216 stars 25 forks source link

Revert useless changes in 2.1.1 #313

Closed benoit74 closed 2 months ago

benoit74 commented 2 months ago

Changes done in 2.1.1 were not necessary, these were only "mistakes" from SE: https://meta.stackexchange.com/a/398606

We have to revert at least the part doing the conversion from UTF16 + the adaptation of XML "parsing" (spaces at the beginning / end, CRLF vs CR, ...)

I suggest to keep the part taking care of removing deleted posts / comments since even if this is also not supposed to be present in the data dump, I do not expect this check to cause much trouble + it might be present in the data dump in the future.

rgaudin commented 2 months ago

Makes sense