Clean up more the output of `convert_docx_to_markdown`

kaizen-ai / kaizenflow

KaizenFlow is a framework for Bayesian reasoning and AI/ML stream computing

GNU General Public License v3.0

112 stars 77 forks source link

Clean up more the output of `convert_docx_to_markdown` #1068

Open samarth9008 opened 4 months ago

samarth9008 commented 4 months ago

The doc for the flow is here docs/documentation_meta/all.gdocs.how_to_guide.md

The script is here dev_scripts/documentation/convert_docx_to_markdown.py

[x] Convert a Gdoc file into a markdown - https://docs.google.com/document/d/1vg2IipQ4csA18hv041uRjxyF_vcdHrWuhpejf2zhaUs/edit
[ ] Merge it and remove the Gdoc
[x] Add unit tests for the script -
[x] Extend the script to handle a few corner cases

Remove - [<span class="underline">Google Authenticator</span>]
Remove - 
[<u>Google Authenticator</u>]
… (one char) -> …

Remove the style `<img src="figs/setup_vpn_and_dev_server_access/image1.png" style="width:5.43272in;height:4.52727in" />`

I would focus on one thing at a time. Start with something easy and have multiple PRs to achieve the goal to ease up the review

FYI @gpsaggese @sonaalKant

mayank922 commented 4 months ago

Hi @samarth9008

Could you please provide access to the Gdoc file mentioned above

samarth9008 commented 4 months ago

@surbhi498 and @mayank922

Seeing the complexity of the task, can you guys pls co-ordinate and work on it together. It might take long for one person to work on it. Consider it as one of the team work testing task.

FYI @DanilYachmenev @gpsaggese @sonaalKant

surbhi498 commented 4 months ago

Hi @samarth9008

Could you please provide access to the Gdoc file mentioned above

gpsaggese commented 4 months ago

Shared.

mayank922 commented 4 months ago

Hi @samarth9008

I am unable to find dev_scripts/lint_md.sh. Has the name of this file changed or is there any other way to use linter on md files?

samarth9008 commented 4 months ago

To run the linter follow this guide

https://github.com/kaizen-ai/kaizenflow/blob/master/docs/coding/all.submit_code_for_review.how_to_guide.md#run-linter

mayank922 commented 4 months ago

Hi @samarth9008

I did follow the one in the docs for my previous PR but this doc docs/documentation_meta/all.gdocs.how_to_guide.md mentions it incorrectly.

We could correct this line https://github.com/kaizen-ai/kaizenflow/blob/5ddfaadcdb332e39a8344741add21e32a7bb702d/docs/documentation_meta/all.writing_docs.how_to_guide.md?plain=1#L348

samarth9008 commented 4 months ago

This was the old way of running linter specifically on md files. Lately we merged with our python linter and even md files can be lint using i lint --files .... .

Feel free to create a PR for it.

mayank922 commented 4 months ago

Hi @samarth9008

The second task would be done by you right?

samarth9008 commented 4 months ago

Yes

surbhi498 commented 4 months ago

Hi @samarth9008, When we are running the test cases against the fun it gives error due to the interactive mode used in the function in script. We are using an alternative "pytest -s -v" to run the test cases locally. Is there any other alternative for this that would not give error while we create a PR for the same.

samarth9008 commented 4 months ago

Instead of testing the script, can we only tests the function used in the function.

mayank922 commented 4 months ago

Do you mean that we just test the rest of the functionalities of this fun and don't test the docker command which uses the interactive mode?

docker_cmd = f"docker run --rm --user $(id -u):$(id -g) -it --workdir {work_dir} --mount {mount} {docker_container_name} {convert_docx_to_markdown_cmd}"

mayank922 commented 4 months ago

Hi @samarth9008

How do you want us to test the _move_media( ) function?

We can write a test case where if there is No media directory found, it doesn't give any output just logs it.

To test for a directory that exists do you want us to create a test directory?

samarth9008 commented 4 months ago

For now lets do one thing

Try to find tests cases similar to this case. Test case about scripts and see how they are tested.

Try to look around the code to understand how temp dir can be created. Looking through other code can give more idea. If function seems complex or have no idea how to move forward, lets leave a TODO and we will address it separately in a different issue.

mayank922 commented 4 months ago

I think we can leave a TODO for _convert_docx_to_markdown function.

We will try again working on _move_media( ) function and start working on _clean_up_artifacts function

surbhi498 commented 3 months ago

I have raised PR for the Script. The Link is given as below https://github.com/kaizen-ai/kaizenflow/pull/1092#issue-2424171114

surbhi498 commented 3 months ago

Hi @samarth9008, I have raised PR for the script enhancement Link is enclosed here with PR_For_Script_Enhancement