thomjur / PyCollocation

Python module to do simple collocation analysis of a corpus.
GNU General Public License v3.0
0 stars 1 forks source link

Unit tests #2

Open trutzig89182 opened 2 years ago

trutzig89182 commented 2 years ago

I started to conceive some basic unit tests, if you want to have a look: https://github.com/trutzig89182/PyCollocation/commit/84fbfe29d0d09c29af6540f9febf730336f7ae88

Not really good in testing, so if you want to go a completely different way, that’s fine with me, too.

thomjur commented 2 years ago

That looks great to me! Do you want to add it to the repo? I am not sure if we already need branches at this point. To be honest, my only experience with unit testing is with Java (JUnit), so I am not sure how/where one would implement them in Python (I think there are also packages for that). Maybe you know more, but I think the test that you developed is perfect for the moment. This is generally something I need to read more about since I've been using Python only in short scripts or in Jupyter Notebooks until now.

thomjur commented 2 years ago
trutzig89182 commented 2 years ago

I a complete noob to testing, but I thought this project could be a good starting point to get a minimal understanding how it is done. There are packages for unit testing for python and basically they are based on unit testing in java, as I understand it.

I agree that branches are not necessary at this point. I use the fork a way to discuss changes before making the pull request from time to time, especially when I am not sure about what I do. :)

I have not included a test for the full_count variables yet, but will try to add it some time soon. Also I wondered if the „word_counter“, which counts the number of times the keyword appears in the text, could be something we want to be able to access outside of the collocation() function. I wanted to include it in the test, but then I saw that you had replaced it with the full_counter in the return statement.

thomjur commented 2 years ago

Agreed, always good to learn something new. :-) Sure, please add/change whatever you see necessary. If you or I don't agree with the changes made by the other person, we can discuss them or simply propose a different version. If there are huge changes that we would like to discuss first, it might also be a good idea to create a pull request and leave it open for discussion.

thomjur commented 2 years ago

I have fixed an issue with the unit test (assertion was 2 'dolores' in the second test, but there is only one on the right of the search term). Or was this intended to fail? In this case, we should use a negative assertion.

trutzig89182 commented 2 years ago

Ah, sorry, must have missed that as I expected the test to fail due to the missing tokenization. I fiddled around with the test_sentences a bit in the process. So yes, 1 "dolor" is correct.

thomjur commented 2 years ago

I have now started implementing "real" unit testing... at least, I have implemented the corresponding framework. :D If you want to read more: Doc unittests

The unit tests can be started via python test.py -v the -v just makes the output more verbose, it also works without this argument. Let me know what you think!

trutzig89182 commented 2 years ago

Ah, we have been working on this at the same time. Just wondered about the merge conflict. Our files are almost identical. I did not use a setUp function to provide the test sentences, but i guess it is more elegant, especially if we want to provide other data or define more properties for testing in the future.

As you already mentioned I think that test_tabular_output is not a test in the sense of the unittests. So I would probably delete it from test.py and rather experiment with it in the analysis file.

One thing I would suggest, is also working with a main function in analysis.py, which allows to separate a structured/visual output when calling the file directly, from accessing the function and getting the raw data, without having things printed to the shell.

thomjur commented 2 years ago

@trutzig89182 Regarding your question about docstring/multiline comments: It is used to document the code and, among others, used when calling help(function). It is also parsed by documentation generators such as Sphinx. See here: Documenting Python code

/EDIT: I am not using them correctly either, but it might be good to already have them - although they are not yet well-formatted.

trutzig89182 commented 2 years ago

Thanks, that was completely new to me.

trutzig89182 commented 2 years ago

Fyi: started to work a bit on automated tests via GitHub, but did not get it to work properly yet.

https://github.com/trutzig89182/PyCollocation/blob/main/.github/workflows/python-tests.yml

Will try to put in some more work on the core functionalities this weekend.

thomjur commented 2 years ago

Cool, I will have a look at that asap! I was quite busy these past few days, but I hope to be able to work some more on our project this weekend! I plan to continue working on the results output.

trutzig89182 commented 2 years ago

Having said that, I just found the solution. You had it in the readme, punkt must be downloaded via nltk, which had to be added to the automatization file was not done by including nltk in the requirements. I will make a PR, but not merge it, so you can have look, as the test will run via your GitHub account (don’t suppose this is a problem, but anyhow). As this is no urgent thing, take your time.

thomjur commented 2 years ago

@trutzig89182 looks interesting, I have never done any automated testing like this. do you have a link or something for further reading? or can you just explain what you did? thanks!

trutzig89182 commented 2 years ago

Sure, basically the yml file in .github/workflow is executed, which defines a condition (on push/on pull request) and when these conditions are met it kind of simulates a shell, loading the dependencies and then running the test file. Basically, every time you make a pull request or push something to the repo, the tests defined in test.py are excecuted and checked, if they pass. If not you get a warning. One of the texts I read for this: https://blog.deepjyoti30.dev/tests-github-python

thomjur commented 2 years ago

Wow... just had a quick look at it. This is so cool and useful, thanks!

trutzig89182 commented 2 years ago

We could perhaps add a test that just checks if get_results_collocates() is executed without error along these lines.

def test_get_results_collocates(self):
        '''
        Test if get_results_collocates in display.py is executed without an error
        '''
        self.assertTrue(get_results_collocates(left_counter, right_counter, total_word_counter, search_term_counter, l_window, r_window, statistic, output_type, k=3))

Just drop this here, because I will have to do other stuff now and it is nothing to push, yet. Still to decide if we add fixed data for the variables passed to get_results_collocates() or if we get the data from one of our functions. Not sure what is best practice and how best to deal with the fact, that display.get_results_collocates() is called in start_collocation_analysis(), so using it for testing get_results_collocates() would be a bit circular.

thomjur commented 2 years ago

Also just a short note, but I stumbled upon this interesting doctest module in a different context. I don't think it makes sense to use in our case, since we have to set up the data etc., but it's an interesting attempt to just code the tests right into the comments... https://docs.python.org/3/library/doctest.html