jacksonllee / pylangacq

Language Acquisition Research Tools
https://pylangacq.org
MIT License
37 stars 18 forks source link

Reading in chat data direct from memory #1

Closed bsherin closed 3 years ago

bsherin commented 6 years ago

Hi, I'm wondering if there's a way to take a chat file that exists as a string in memory, and to load it directly into a Reader instance. Thanks for any help.

jacksonllee commented 6 years ago

Hi, although the latest v0.10.0 doesn't offer a way to do this, I'm considering adding a class method from_chat_str to Reader based on your question, so that you can do something like this:

import pylangacq as pla
reader = pla.Reader.from_chat_str(chat_data_str, encoding='utf-8')

where chat_data_str is your CHAT data as an in-memory string (a string of what a single CHAT data file would be).

The current master branch on GitHub has just been updated with this class method. You may try it out by installing this dev version of pylangacq:

pip install git+https://github.com/pylangacq/pylangacq.git

Would you be able to let me know if this is what would work for your use case? If so, I'll make a release on PyPI so that this class method will be more readily available with pip install pylangacq. Thanks!

bsherin commented 6 years ago

Thanks! I’ll take a look, probably tomorrow afternoon. If you’re interested, the reason I’m looking for this capability is that I’m building a web-based system for social science data analysis that makes libraries available to researchers. Users/coders that use the system access data from an API, rather than a file system.

Here's a link: https://tactic.readthedocs.io/en/latest/index.html

bsherin commented 6 years ago

That worked like a charm! The only thing that's missing is a way to load the equivalent of multiple strings, corresponding to multiple chat files. Something like the equivalent of the add command would do the trick.

Some other thoughts for the future: I see that you made this work by writing a temporary file to the local file system. That made it possible for you make this change with only a small addition. On my system, however, the code run in a user's data analysis isn't really supposed to write anything to the file system. (It's a long story. Your new code still works in my system, there are just some small limitations.) When I originally looked at your code, I was looking to see if there was a way to keep the files around as something like StringIO instances. But, it looked like that would require a lot of changes, distributed throughout the code.

So, for my very specific case, it would be a little helpful to not have the data written to the file system. But I'm not sure that other users would have a need. And the current version does 99% of what I need. Thanks again.

jacksonllee commented 6 years ago

The only thing that's missing is a way to load the equivalent of multiple strings, corresponding to multiple chat files.

One workaround would be to create an empty Reader instance and then add the individual Reader objects instantiated by a CHAT str. Something like this (not tested):

import pylangacq as pla

master_reader = pla.Reader()

for chat_str in chat_strs:  # chat_strs is your container of CHAT strings
    reader = pla.Reader.from_chat_str(chat_str)
    master_reader.add(reader)

When I originally looked at your code, I was looking to see if there was a way to keep the files around as something like StringIO instances.

StringIO did cross my mind yesterday evening when I tried to implement the from_chat_str, but I went down the temp file path instead to get something operational real quick for you to test out. Using StringIO should be possible -- the extra work would just be a bit of refactoring. Stay tuned for another update!

bsherin commented 6 years ago

That workaround sounds pretty darn easy. I'll give it a try. Thanks!

bsherin commented 5 years ago

A minor update. I just got around to trying the workaround suggested above. I think the last line needs to be master_reader.update(reader) rather than master_reader.add(reader)

jacksonllee commented 3 years ago

Apologies for the long silence. I've just released v0.13.0. I ended up rewriting the whole Reader class, with a fair amount of breaking changes (changelog). The Reader classmethod from_strs reads CHAT data strings without hitting the disk, documentation here. I'm closing this issue as resolved. Please let me know if you have any questions.