semanticClimate / city-open-climate-reader

City - Open Climate Reader: A proof-of-concept prototype for a semanticClimate publication built on a Quarto / Jupyter Notebook model for computational publishing
https://semanticclimate.github.io/city-open-climate-reader/
MIT License
1 stars 3 forks source link

Run HTML to Markdown conversion #1

Open mrchristian opened 1 year ago

mrchristian commented 1 year ago

https://github.com/semanticClimate/city-climate-plans-notebook

mrchristian commented 1 year ago

@Mahvish will run the HTML zo Markdown Python script - once she has joined the semanticClimate GitHub organisation. The repository with the script is linked above and the script can be found in the python directory. I will pass along instructions for use from Simon Bowie.

mrchristian commented 1 year ago

Quote from Simon Bowie:

I wrote a relatively basic Python script (https://github.com/SimonXIX/quarto_semanticclimate/blob/main/python/quarto_markdown.py) which takes HTML as input, processes the HTML, converts it to Markdown, and then processes the Markdown before outputting to a .qmd file.

The HTML generated from the IPCC report was fairly messy so various bits of processing were required on the HTML to convert the styles in the header to HTML styles that the Markdown converter could understand. The processing on the Markdown is to tidy it up and make the converted text look better in Markdown format.

At the most basic level, the HTML to Markdown conversion is done using the markdownify module: https://pypi.org/project/markdownify/

This is run using python3 ./python/quarto_markdown.py.

The input I used was https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/syr/lr/html/fulltext/groups_groups.html and the output can be seen through Quarto rendering at https://simonxix.github.io/quarto_semanticclimate/groups_groups.html

End quote

mrchristian commented 1 year ago

Hi @06maHi do you want to have a go at running the Python script Simon Bowie has added here, you can fork this repo if you like. Please feel free to ask any questions or reach out for help if you need it. You dont need to run Quarto as well - but you can if you like - info here https://nfdi4culture.github.io/FSCI-Class-Publishing-from-Collections/#_5_0

06maHi commented 1 year ago

hello @mrchristian I have installed markdownify package to convert html into markdown. Let me know if this is the correct one. Also I will need some help using it.

mrchristian commented 1 year ago

The first step would just be to see if script runs in a fork of your own.

To run the HTML to Markdown you only need to run the script in the directory /python - you dont need to run Quarto - although you can if you like

the requirements need installing - https://github.com/semanticClimate/city-climate-plans-notebook/blob/main/python/requirements.txt

script - https://github.com/semanticClimate/city-climate-plans-notebook/tree/main/python

mrchristian commented 1 year ago

I think in script we'd have to change local paths, also check if HTML from PMR gets copies across first.

https://github.com/semanticClimate/city-climate-plans-notebook/blob/0f03a1171e0805ad915aa446a40835aaa2f5a3eb/python/quarto_markdown.py#L7

mrchristian commented 1 year ago

A few questions: Did you edit the local paths to get the script to work? Are you running it on a Clone or Fork repository? If your up for it I would suggest trying to install the Quarto framework - instructions are in the README.md https://github.com/semanticClimate/city-climate-plans-notebook/blob/main/README.md and you can read more here about install and use https://nfdi4culture.github.io/FSCI-Class-Publishing-from-Collections/

06maHi commented 1 year ago

Yaa I have edited the local path and also used fork repo. I will go for quarto installation once I am back to work.