harmonydata / harmony

The Harmony Python library: a research tool for psychologists to harmonise data and questionnaire items. Open source.
https://harmonydata.ac.uk
MIT License
8 stars 18 forks source link

Allow batching of items when sent to LLM #56

Closed woodthom2 closed 1 week ago

woodthom2 commented 1 month ago

Description

Can we modify convert_texts_to_vector in https://github.com/harmonydata/harmony/blob/main/src/harmony/matching/default_matcher.py to allow items to be batched when sent to the LLM?

Batch size should be variable

Rationale

If a user wants to harmonise 10,000 items, this will not fit in memory even in a high performance machine. Small laptops probably can only batch 20 items at a time. But the batching should be configurable as it will slow things down. Perhaps as a parameter.

People have reported that the website cannot cope with large harmonisations. E.g. below comment on Discord (23 Oct 2024)

image

makrianast commented 3 weeks ago

@woodthom2 Hello. If this issue is still open, i would love to work on that and contribute to your project.

woodthom2 commented 3 weeks ago

Hi @makrianast , please feel free to take this on! Thanks so much! Do you want to have a quick chat with me on Discord/Google Meet about it?

woodthom2 commented 3 weeks ago

Just FYI the server that is running the Harmony web tool is 16 GB. I have not tested to find out at what size a request crashes the server but I am pretty certain that the critical number is between 50 and 2000 questionnaire items! Of course we have to allow for different user machine specs

makrianast commented 3 weeks ago

Hello @woodthom2 . Yes of course. My discord is: anastasiamakrii . Feel free to contact me there if you'd like!

woodthom2 commented 3 weeks ago

Thanks!

On Fri, 1 Nov 2024, 17:49 makrianast, @.***> wrote:

Hello @woodthom2 https://github.com/woodthom2 . Yes of course. My discord is: anastasiamakrii . Feel free to contact me there if you'd like!

— Reply to this email directly, view it on GitHub https://github.com/harmonydata/harmony/issues/56#issuecomment-2452316158, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADUBTVJO3H3YVFQCQY3UZXDZ6O5JPAVCNFSM6AAAAABQQRBEBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSGMYTMMJVHA . You are receiving this because you were mentioned.Message ID: @.***>

woodthom2 commented 2 weeks ago

Also related to https://github.com/harmonydata/harmony/issues/63