thomjur / PyCollocation

Python module to do simple collocation analysis of a corpus.
GNU General Public License v3.0
0 stars 1 forks source link

Feature: Merged Results #11

Closed thomjur closed 2 years ago

thomjur commented 2 years ago

@trutzig89182 I plan to work on a combined result later today or tomorrow (to avoid that we are both working on the same feature again^^).

My idea is the following: I think it is generally common to combine the results from the left and right collocation analysis and to only indicate on which side the term appears most frequently.

I therefore plan to combine left_counter and right_counter dicts to generate a single dict. While doing this, I also plan to add some more details to the pandas output. It should look like this:
idx word coll_frequency orient total_freq
1 dolor 4 left 8
2 etiam 2 - 10
3 lares 9 right 9
trutzig89182 commented 2 years ago

I think thats a great idee. Tried to use the collocations function with my jsonl data this weekend and thought about what output could be good. Probably it will be useful to add an option for an csv output for the results later, too.

I think it is not that bad if we work on something in parallel. I would not have understood what you did, if I hadn’t been working on the tests before. And in the end it can be good to compare solutions and pick the better option. And I think with the extended code it will be less likely, that we work on exactly the same lines.

But anyway, I will not work on the results/output question today.

thomjur commented 2 years ago

That is true! The great advantage of using pandas is that we can easily convert the dataframes to whatever format we like. So, instead of printing the frame we could also say to_csv() or to_markdown() and return the corresponding format. Yet, in this case we should rename the function since it is no longer a mere "display" function. I think what would also be important is to add some statistic measures such as log-lik soon. Maybe we can create a subfolder statistics or something, but this should be a separate feature/issue.

trutzig89182 commented 2 years ago

Would it make sense to call it results or return_results? Then shell output could be the default and other options added.

But probably this depends on how we want the interface to the package be in the end. Do we want collections, analyses, display/results to be accessed directly, or funnel it through one command that allows to access the different functions and offers different ways of receiving the data?

thomjur commented 2 years ago

Good question... I personally think it would make sense to pipe everything, maybe even offering to start the programm from a command line. I mean, we have one particular functionality. But what do you think?

thomjur commented 2 years ago

Okay, I did not have much time today, but I have implemented a basic version that can still be started via test.py for the moment (need to change that). It only prints the dataframe, but we can work on that, too. Also, I thought we should soon add more sophisticated test data. I will continue working on this tomorrow.