goodmami / penman

PENMAN notation (e.g. AMR) in Python
https://penman.readthedocs.io/
MIT License
135 stars 26 forks source link

Cross-platform/encoding support for IO operations #109

Closed BramVanroy closed 1 year ago

BramVanroy commented 1 year ago

I tend to run my dev code locally on my powerful Windows-based home set-up before pushing to the cluster. In this instance, I am trying to run spring which relies on penman. It makes some calls to penman's load method. I am running their preprocessing to parse AMR data and found calls to penman._load result in encoding issues.

I am aware I can just run on WSL and be done with it, but I'd rather see this useful tool be available cross-platform. Is there any way I can contribute? Which methods are all reliant on encoding? My approach would be to allow for an optional encoding argument, as is common in enc/dec methods, and pass it through the relevant IO functions like open.

For starters I can start with codec if that is okay, e.g., change the _load function to

def _load(source: FileOrFilename,
          model: Model = None,
          encoding: str = None) -> List[Graph]:  # EDITED
    """
    Deserialize a list of PENMAN-encoded graphs from *source*.

    Args:
        source: a filename or file-like object to read from
        model: the model used for interpreting the graph
    Returns:
        a list of Graph objects
    """
    codec = PENMANCodec(model=model)
    if isinstance(source, (str, Path)):
        with open(source, encoding=encoding) as fh:  # EDITED
            return list(codec.iterdecode(fh))
    else:
        assert hasattr(source, 'read')
        return list(codec.iterdecode(source))
goodmami commented 1 year ago

Thanks for bringing that up. I don't have a Windows machine to test on, but I recall that Windows has strange, sometimes non-standard default encodings, such as UTF-8 with a byte-order-mark. I agree that allowing an encoding option would be helpful. I think you'd only need to target where text is read from the filesystem, as everything should be native Python (unicode) strings internally, so look for calls to open(). I see the following:

https://github.com/goodmami/penman/blob/4191a6d604d5827cbff3545bcc006d90ff0930ed/penman/codec.py#L256

https://github.com/goodmami/penman/blob/4191a6d604d5827cbff3545bcc006d90ff0930ed/penman/codec.py#L295

https://github.com/goodmami/penman/blob/4191a6d604d5827cbff3545bcc006d90ff0930ed/penman/__main__.py#L267

There are a couple other places open() is called, but they seem less urgent (e.g., in setup.py or the unit tests).

If you want to submit a PR with the necessary changes, that would be great.

BramVanroy commented 1 year ago

Thank you for the quick response! Indeed, unfortunately Windows still relies on cp1252 encoding (not uf-8-bom), which always leads to small issues like this. I'll look for built-in "open" methods and allow for the option of custom encoding there. I'll send in a PR soon.