Closed AaryanTR closed 3 months ago
@ansukla would you kindly review this PR? Or was this project abandoned?
@rodrigomeireles and @AaryanTR. Sorry for the delay - things have been hectic and I only get nights and weekends to work on this now. If either one of you would like to be a maintainer, please let me know.
@AaryanTR - Did you run the test cases and would it be possible for you to add a new one to test this change? This being a significant change to the two main functions, would help to have the test cases to better understand the behaviour and retain it through other changes in future.
Hi @ansukla
Thank you for your response. I have ran the tests and it is successful. I have also added test cases for to_text
and to_html
methods in the latest commit.
Test before adding the new test cases:
Test after adding the new test cases:
@rodrigomeireles and @AaryanTR. Sorry for the delay - things have been hectic and I only get nights and weekends to work on this now. If either one of you would like to be a maintainer, please let me know.
I would love to be a maintainer. But this is my work account. Is it possible for you to add my personal github account @aaryan200 as a maintainer?
Thank you
Fixes #79
Here is the current implementation of the
to_text
method:The issue occurs when the document tree has the following (or similar) structure:
The text of
Section-2
is included in the output when theto_text
method is called (recursively) forSection-1
as well as forSection-2
. Similarly, the text ofSection-3
is also duplicated in the output.To remove the duplicates, iterate over all the sections in the document and choose the top sections i.e., those sections which are not a children of any other section. At the end, concatenate the (recursive) text of all the top sections.
Here is a summary of the changes made:
_get_top_sections
returns all the top sections in the document tree.to_text
andto_html
methods accept a boolean parameterinclude_duplicates
.include_duplicates
isFalse
, only the top sections are considered.You can take a look at this fix in Google Colab here.