kikass13 / awesome2py

This tool can convert awesome lists to python data-sets using beautiful-soup and markdown
2 stars 0 forks source link

Awesome2py does not swallow Open Sustainable Technology completely #1

Closed Ly0n closed 2 years ago

Ly0n commented 2 years ago

Hey @kikass13,

I wonder if you know why awesome2py does not take the complete list of OpenSustain.tech. It always stops at the Renewcast project but I do not understand why. The list does not differ here in any way. Maybe it is a buffer that is running out?

Kind regards Tobias

kikass13 commented 2 years ago

@Ly0n I tested it ...

it does print all of the 360 entries ... only the order is not represented (because it is not necessary) .

which means, that the last thing you see in the output is not the last entry of the list.

or am I wrong here?

Ly0n commented 2 years ago

The list has more than 1000 entries :).

kikass13 commented 2 years ago

yeah you are right.

Renewable Energy [49] Energy Storage [33] Energy Demand and Efficiency [10] Energy Systems [8] Datasets on Energy Systems [3] Emissions [23] Ecological Footprint [8] Biosphere [48] Natural Resources [68] Agriculture and Nutrition [4] Sustainable Investment [19] Further Open and Sustainable Resources [87]

is what i get ... some (nearly all?) of the sub entries are not parsed properly... ill take a look

kikass13 commented 2 years ago

well as it seems, sub lists are currently not supported.

I will try to hotfix this issue

Ly0n commented 2 years ago

well as it seems, sub lists are currently not supported.

I will try to hotfix this issue

Thank you so much. This will help us a lot. @rockita and @tjarkdoering

kikass13 commented 2 years ago

I remeber why this is ... awesome list format is retarded and the contents table is not really parsable in .html ...

<h2>Contents</h2>
<ul>
<li><a href="#renewable-energy">Renewable Energy</a></li>
<li><a href="#photovoltaics-and-solar-energy">Photovoltaics and Solar Energy</a></li>
<li><a href="#wind-turbines">Wind Turbines</a></li>
<li><a href="#hydro-energy">Hydro Energy</a></li>
<li><a href="#geothermal-energy">Geothermal Energy</a></li>
<li><a href="#bioenergy">Bioenergy</a></li>
<li><a href="#energy-storage">Energy Storage</a></li>
<li><a href="#battery">Battery</a></li>
<li><a href="#hydrogen">Hydrogen</a></li>
<li><a href="#energy-demand-and-efficiency">Energy Demand and Efficiency</a></li>
<li><a href="#buildings-and-cities">Buildings and Cities</a></li>
<li><a href="#mobility-and-transportation">Mobility and Transportation</a></li>
<li><a href="#production-and-industry">Production and Industry</a></li>
<li><a href="#computation-and-services">Computation and Services</a></li>
<li><a href="#energy-systems">Energy Systems</a></li>
<li><a href="#modeling-and-optimization">Modeling and Optimization</a></li>
<li><a href="#monitoring-and-control">Monitoring and Control</a></li>
<li><a href="#energy-distribution-and-grids">Energy Distribution and Grids</a></li>
<li><a href="#datasets-on-energy-systems">Datasets on Energy Systems</a></li>
<li><a href="#emissions">Emissions</a></li>
<li><a href="#carbon-intensity">Carbon Intensity</a></li>
<li><a href="#carbon-capture-and-removel">Carbon Capture and Removel</a></li>
<li><a href="#emission-observation-and-modeling">Emission Observation and Modeling</a></li>
<li><a href="#ecological-footprint">Ecological Footprint</a></li>
<li><a href="#life-cycle-assessment">Life Cycle Assessment</a></li>
<li><a href="#circular-economy-and-waste">Circular Economy and Waste</a></li>
<li><a href="#biosphere">Biosphere</a></li>
<li><a href="#life-forms-and-biodiversity">Life Forms and Biodiversity</a></li>
<li><a href="#ice-and-poles">Ice and Poles</a></li>
<li><a href="#salt-and-fresh-water">Salt and Fresh Water</a></li>
<li><a href="#atmosphere">Atmosphere</a></li>
<li><a href="#climate-and-earth-modeling">Climate and Earth Modeling</a></li>
<li><a href="#earth-climate-datasets-and-tools">Earth Climate Datasets and Tools</a></li>
<li><a href="#natural-resources">Natural Resources</a></li>
<li><a href="#air">Air</a></li>
<li><a href="#water">Water</a></li>
<li><a href="#soil-and-land">Soil and Land</a></li>
<li><a href="#agriculture-and-nutrition">Agriculture and Nutrition</a></li>
<li><a href="#sustainable-investment">Sustainable Investment</a></li>
<li><a href="#further-open-and-sustainable-resources">Further Open and Sustainable Resources</a></li>
</ul>

As a matter of fact, I can implement it so that the subgroups are added .. .but this would result in all entries doubling. Because i use the content field as a lookup,and the content does not represent sub-categories propery, my algorithm will duplicate or fail.

Ly0n commented 2 years ago

@kikass13 I added a marker some while ago to cut out the content area when creating the OpenSustain.tech website. Maybe this will also help as a workaround.

kikass13 commented 2 years ago

well that wont work, i use / need the contents table and it is a useful thing to have. I am currently trying to add something magical, gimme a sec

kikass13 commented 2 years ago

its a bit hacky now ... but should work :)

===============================================
Photovoltaics and Solar Energy [49]
Wind Turbines [33]
Hydro Energy [10]
Geothermal Energy [8]
Bioenergy [3]
Battery [23]
Hydrogen [8]
Buildings and Cities [48]
Mobility and Transportation [68]
Production and Industry [4]
Computation and Services [19]
Modeling and Optimization [87]
Monitoring and Control [9]
Energy Distribution and Grids [33]
Carbon Intensity [30]
Carbon Capture and Removel [17]
Emission Observation and Modeling [6]
Life Cycle Assessment [24]
Circular Economy and Waste [14]
Life Forms and Biodiversity [8]
Ice and Poles [84]
Salt and Fresh Water [26]
Atmosphere [55]
Climate and Earth Modeling [26]
Earth Climate Datasets and Tools [55]
Air [78]
Water [23]
Soil and Land [62]
===============================================
Done parsing '910' entries.

I hope this format doesn't get worse ... this sublisting is quite annoying . maybe i have to do something in the beautifulsoup package ... the fact that markdown lists are not parsed clearly (with respect to indentation) is quite the bug ... :/

kikass13 commented 2 years ago

@Ly0n can you test this an close the issue if it's working correctly?

kikass13 commented 2 years ago

well I can see that it's not ... so NVM.

The problem is, that your list uses [FORMAT1]

 ## some main list
     ### some sub list
         - some items

as well as [FORMAT2]

 ## some main list
   - some items

so the parser CANNOT parse both these semantics at the same time (because the contents are not differentiable by default).

So we have a few options:

Ly0n commented 2 years ago

@kikass13 I think we can do that.

kikass13 commented 2 years ago

i fixed your awesome list for your, the result looks like this (which is in fact the correct amount of parsed data)

===============================================
Photovoltaics and Solar Energy [49]
Wind Turbines [33]
Hydro Energy [10]
Geothermal Energy [8]
Bioenergy [3]
Battery [23]
Hydrogen [8]
Buildings and Cities [48]
Mobility and Transportation [68]
Production and Industry [4]
Computation and Services [19]
Modeling and Optimization [87]
Monitoring and Control [9]
Energy Distribution and Grids [33]
Datasets on Energy Systems [30]
Carbon Intensity [17]
Carbon Capture and Removel [6]
Emission Observation and Modeling [24]
Life Cycle Assessment [14]
Circular Economy and Waste [8]
Life Forms and Biodiversity [84]
Ice and Poles [26]
Salt and Fresh Water [55]
Atmosphere [26]
Climate and Earth Modeling [55]
Earth Climate Datasets and Tools [78]
Air [23]
Water [62]
Soil and Land [37]
Agriculture and Nutrition [38]
Finances [10]
Miscellaneous [49]
===============================================
Done parsing '1044' entries.
Ly0n commented 2 years ago

@kikass13: I just reorded the list like you proposed. It is now also much better ordered in this way.

kikass13 commented 2 years ago

I hope this solves the problem .. for now :p

Ly0n commented 2 years ago

It works perfectly. Thank you such much @kikass13 :heart:. You can see the current implementation here: https://github.com/protontypes/sustainbeat For now on I'm copied your project in the sustain best folder. It would be good to create a pip package out of it or we could maintain your code in sustainbeat from now on. What do you think?

kikass13 commented 2 years ago

Not really interested :D you can do that if you want but I like it being something random (because it is) :p

kikass13 commented 2 years ago

Im gonna Close this cause @Ly0n seems happy with my hack :)