Open SirRob1997 opened 1 year ago
@mjpost @rewicks let me know in case you want to get this integrated, I currently have this implemented in a private fork but I can open up a PR.
I agree that the segmentation should be invertible. Maybe this could be added with a command-line option. I also think the fields should be tab-delimited.
Often, users may want to apply this sentence segmenter within larger pipelines, e.g. one use case is segmenting document-level data into sentence-level segments that can be easily translated by sentence-level machine translation systems:
The problem here is that after sentence segmentation we would get something like:
Currently, there is no option to also get the original sequence mapping without splitting requests into per line requests (very inefficient). This becomes essential if users want to re-construct the original sequence, e.g. after translating. Ideally, there should be an option to additionally print the original sequence ID either in the same file or in a separate file, e.g. something like:
I believe it should be relatively straightforward to implement, simply
enumerate
here https://github.com/rewicks/ersatz/blob/e5ed3ebbc64ac5993093ee42bca3a282d45e556e/ersatz/split.py#L169 and write into either a separate file or into the same file in atsv
fashion for each line that gets written already.