Open shawwn opened 4 years ago
Specifically this line: https://github.com/soskek/bookcorpus/blob/05a3f227d9748c2ee7ccaf93819d0e0236b6f424/epub2txt.py#L149
When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times.
The reason is that the Table of Contents looks similar to this:
ch1.html#section1 ch1.html#section2 ch1.html#section3 ... ch2.html#section1 ch2.html#section2 ...
The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.
I have a fixed version here: https://github.com/shawwn/scrap/blob/afb699ee9c8181b3728b81fc410a31b66311f0d8/epub2txt#L158-L206
Thank you! I'll fix it!
Specifically this line: https://github.com/soskek/bookcorpus/blob/05a3f227d9748c2ee7ccaf93819d0e0236b6f424/epub2txt.py#L149
When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times.
The reason is that the Table of Contents looks similar to this:
ch1.html#section1
ch1.html#section2
ch1.html#section3
... ch2.html#section1 ch2.html#section2 ...
The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.
I have a fixed version here: https://github.com/shawwn/scrap/blob/afb699ee9c8181b3728b81fc410a31b66311f0d8/epub2txt#L158-L206