pypsa-meets-earth / pypsa-earth

PyPSA-Earth: A flexible Python-based open optimisation model to study energy system futures around the world.
https://pypsa-earth.readthedocs.io/en/latest/
207 stars 167 forks source link

Issue 501 os to pathlib #1017

Closed finozzifa closed 3 weeks ago

finozzifa commented 2 months ago

Closes # (if applicable).

Issue #501

Changes proposed in this Pull Request

The PR aims at replacing the os module with the pathlib module

Checklist

finozzifa commented 2 months ago

mmm tests were ok when I executed them locally. I need to figure out what's the matter. Apologies for the failing pipelines.

finozzifa commented 2 months ago

Hi all, just as a side-note, every change has been tested locally in ad-hoc python script, in order to make sure that we get with the pathlib module what we got with the os module.

Moreover, I started on May 3rd to replace the code snippets with pathlib that repeat themselves often, with methods in _helpers.py. I unit test these methods, following this procedure: 1) I write the method using the os module code snippet present on the main branch 2) I build the unit test 3) I modify the method, replacing the os module with the corresponding pathlib code snippet 4) I make sure that the unit test still works

ekatef commented 1 month ago

Really a PR that goes through everything. Just added a comment on 1) abstraction consistency and a 2) tests that makes sure that the os to pathlib swap is faultless. On another note, I am not sure if the abstraction layer is super helpful as the pathlib library is quite readable.

This is just a comment. Davide/Katia have for sure more perspectives on this.

Thanks a lot for reviewing, @pz-max!

And thank you @finozzifa for the great work :) As discussed, our "good first issues" are quite labor-intensive, but you manage it graciously. Also, thanks a lot for the PRs you have completed along the way! From our side, I fear that reviewing can take some time. Apologies for that, and please keep in mind that it's not your fault in any way.

Regarding testing, @finozzifa has mentioned that every change has been tested locally to ensure that the workflow output remained the same. [Windows CI currently crashes due to troubles with GADM server: Exception: GADM server is down at https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_NGA.gpkg. Data needed for building shapes can't be extracted.] But I agree that it would be really handy to have some testing output to be sure the introduced changes work, even for some edge cases. @finozzifa do you see any possibility to share the testing infrastructure you have developed? Like creating and pushing a dedicated branch for testing? That is not your word in which I have doubts, rather our ability to detect potential troubles with just an intense look into the code... 🙃

Regarding the consistency, I wonder if a string representation of Path object does make sense (e.g. str(pathlib.Path.cwd())), as it basically gives the result as os methods. [The idea taken from @davide-f :) ] May it be a better approach to revert the changes, if such a transformation is needed? That would also allow to keep the PR lighter...

@finozzifa @davide-f @pz-max what are your feeling about all that? 🙂

finozzifa commented 1 month ago

Hi all,

thanks for your interest in my PR. I really enjoyed working on it :)

-) @pz-max: I created separate methods that enclose pathlib invocations to enable an easier unit testing and to make the code more maintainable. Now, if we need to change one line in (say) method _getpath, we just need to change one file, rather than the entire scripts folder. Finally, I believe that having this abstraction layer makes the code review easier.

-) @pz-max and @ekatef: I have created dedicated unit tests, in the _testhelpers.py file

-)@pz-max and @ekatef: The unit tests have been created as follows: 1) I create the method with the os module , 2) I build the unit test, 3) I re-write the method with pathlib, 4) I make sure that the corresponding unit test still works.

-) Indeed @pz-max , I have not created methods for all invocations of pathlib. My goal would be do that and so having the import pathlib just in the _helpers.py file. I am happy to do it in this issue, or create a new one and work on it after this PR is merged. Let me know what you guys prefer.

Next steps: -) I will investigate further why the pipeline failed (thanks @ekatef) --> the windows pipeline is still failing because this website is not reachable https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_BEN.gpkg --> it now works. Thanks @ekatef for restarting the pipeline -) I will modify the unit tests, adding a further assert statement where I compare the output with the corresponding os invocation I replaced ----> this is done!

finozzifa commented 1 month ago

Hi @FabianHofmann thanks for your comment :)

I think that I just follow a different coding style. Whenever I see a line of code which repeats itself several times, I try (where possible) to create methods that implement such line of code and that can be separately tested.

This is because I want to ensure consistency and make future changes easier. In fact, instead of tracing throughout the code each invocation of (say) os.path.basename(), I just need to modify once the method where such line is implemented.

I am not sure why code readability would suffer following this approach, given the fact that people would just need to trace the method in the _helpers.py to check what it does.

I believe that in this way, code reviews, code merges and further code expansions will be easier, as we can rely on soundproof methods that we have unit-tested.

FabianHofmann commented 1 month ago

I think that is super reasonable. It makes most sense to avoid code duplication. Just felt that some of the functions are not "abstract" enough. Perhaps a common ground is trying to avoid the back and forth between Path and string and trying to stick to Path wherever possible? Btw, f-strings nicely process Path objects, like

p  = Path("path/to/file")
print(f"This is the path: {p}")

But as said, you have the full power :)

finozzifa commented 1 month ago

Hi @FabianHofmann,

thanks again for your comment and your contribution :) (which I really appreciate)

I added the casting to string of the PosixPath object because otherwise some parts of code are failing.

Leveraging on the possibilities offered by the abstract method _getpath, I removed such casting and restarted the pipelines. As you see they are unfortunately failing. One error that I have found is:

_ERROR:_helpers:An error happened in module '/home/runner/work/pypsa-earth/pypsa-earth/scripts/build_naturaraster.py', function '': argument of type 'PosixPath' is not iterable

It seems therefore that in some parts of the code, the casting is indeed needed.

I have created also the _getposixpath method because, based on my tests, the PosixPath object is needed in the _build_testconfigs.py script

How would you proceed in this case?

FabianHofmann commented 1 month ago

Sorry for being late here. I am wondering where this happens. Could you point me to the exact spot? The string object is iterable but in case of a path it would not make sense ("path/to" would be divided into "p", "a", "t", ...).

finozzifa commented 1 month ago

Hi @FabianHofmann, thanks a lot :) I believe that my previous comment was a bit misleading. I just wanted to point out that apparently there are some parts of the code where PosixPath objects cannot be used and so their string casting is necessary. Anyway, I once again remove the string casting from _getpath and I am letting the pipelines fail again so that I can perform further investigations.

finozzifa commented 1 month ago

Hi @FabianHofmann , now get_path returns a posixPath object. I solved the occurrences where the casting to string was needed. Thanks :)

finozzifa commented 1 month ago

Hi all,

please find below a recap of the changes I performed in this PR.

I tried my best to include all of your suggestions and recommendations. If I missed some, I apologize and I will include them as soon as possible.

### Recap of the changes

1) The os module has been replaced with the pathlib module. There are still a few os-module-calls in _build_naturaraster.py (this is because there is no pathlib alternative for os.walk) and in the _helpers.py

2) As mentioned before, I have introduced an abstraction of the path methods. Such methods are available in the _helpers.py script. In fact, whenever I see lines of code that repeat themselfs several times, I try to create methods that implement such lines of code and that can be separately unit-tested. In this was I can ensure consistency and make future changes easier. Instead of tracing throughout the code each invocation of (say) os.path.basename(), I just need to modify once the method where such line is implemented. I believe that in this way, code reviews, code merges and further code expansions will be easier, as we can rely on soundproof methods that we have unit-tested.

3) I have introduced dedicated unit tests for the path methods I coded in the _helpers.py. As of now, these unit tests are NOT executed as part of the CI/CD pipelines.

### Possible spin-offs

1) It would nice to push these modifications a bit further, so that the dependency on the os and pathlib modules is explicit just in the _helpers.py script.

2) It would nice to include the execution of the unit tests in the CI/CD pipeline, as already done for the framework repository

I thank you in advance for your comments in reviewing this PR :)

Fabrizio