rewicks / ersatz

Apache License 2.0
39 stars 5 forks source link

Add option to write reversible sequence mapping #11

Open SirRob1997 opened 1 year ago

SirRob1997 commented 1 year ago

Often, users may want to apply this sentence segmenter within larger pipelines, e.g. one use case is segmenting document-level data into sentence-level segments that can be easily translated by sentence-level machine translation systems:

This is a sample Text. We have multiple sentences.
Sample text 2. With multiple sentences =)

The problem here is that after sentence segmentation we would get something like:

This is a sample Text.
We have multiple sentences.
Sample text 2.
With multiple sentences =)

Currently, there is no option to also get the original sequence mapping without splitting requests into per line requests (very inefficient). This becomes essential if users want to re-construct the original sequence, e.g. after translating. Ideally, there should be an option to additionally print the original sequence ID either in the same file or in a separate file, e.g. something like:

0 This is a sample Text.
0 We have multiple sentences.
1 Sample text 2.
1 With multiple sentences =)

I believe it should be relatively straightforward to implement, simply enumerate here https://github.com/rewicks/ersatz/blob/e5ed3ebbc64ac5993093ee42bca3a282d45e556e/ersatz/split.py#L169 and write into either a separate file or into the same file in a tsv fashion for each line that gets written already.

SirRob1997 commented 1 year ago

@mjpost @rewicks let me know in case you want to get this integrated, I currently have this implemented in a private fork but I can open up a PR.

mjpost commented 1 year ago

I agree that the segmentation should be invertible. Maybe this could be added with a command-line option. I also think the fields should be tab-delimited.