readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.52k stars 231 forks source link

Feature Request: Add a type field in the mplain json output #183

Open johnking opened 7 years ago

johnking commented 7 years ago

Hi @readbeyond

To get the type (Paragraph, Sentence or word) from the syncmap JSON data based on multi-plain text, we have to parse the id field such as "p000014s000001w000002".

It would be nice to have one more field -type into the JSON data to avoid such post-processing.

If it does not make sense to this repository, may you please give me some hints to modify the code by myself?

thanks a lot

-John

readbeyond commented 7 years ago

Hi,

you raise a good point. In fact, I am not really satisfied by how multi-level formats are supported right now. On one hand, e.g. the "compute engine" is generic enough to support an arbitrary number of levels, but global parameters/command line options support 3 and 3 only levels. Moreover, this support is done through replication of keys/names/variables.

For all these reasons, I will probably rework the relevant code for multi-level formats in aeneas v2, and while doing that I will address your issue directly.

Unfortunately, this plan also means that I am not going to address your issue in the 1.x series, so you need to either process the id after the JSON file has been produced, or patch your local version of aeneas. In the latter case, you might want to modify

def format(self, syncmap)

in

https://github.com/readbeyond/aeneas/blob/master/aeneas/syncmap/smfjson.py#L53

or, even better, the code of

@property def json_string(self)

in

https://github.com/readbeyond/aeneas/blob/master/aeneas/syncmap/__init__.py#L248

(you need to keep track of the level in the recursive visit, and add the suitable "type": "value" to the dictionary which is appended in line 262)

HTH,

Alberto Pettarin

On 07/01/2017 05:44 PM, johnking wrote:

Hi @readbeyond https://github.com/readbeyond

To get the type (Paragraph, Sentence or word) from the syncmap JSON data based on multi-plain text, we have to parse the |id| field such as "p000014s000001w000002".

It would be nice to have one more field -|type| into the JSON data to avoid such post-processing.

If it does not make sense to this repository, may you please give me some hints to modify the code by myself?

thanks a lot

-John

johnking commented 7 years ago

@readbeyond , Hi Alberto,

Thanks for your reply and sharing us the roadmap, looking forward to V2.0!

thanks again!

-John

pettarin commented 7 years ago

@johnking hi, you might want something like this:


    @property
    def json_string(self):
        """
        Return a JSON representation of the sync map.
        :rtype: string
        .. versionadded:: 1.3.1
        """
        def visit_children(node, level):
            """ Recursively visit the fragments_tree """
            output_fragments = []
            for child in node.children_not_empty:
                fragment = child.value
                text = fragment.text_fragment
                output_fragments.append({
                    "id": text.identifier,
                    "language": text.language,
                    "lines": text.lines,
                    "begin": gf.time_to_ssmmm(fragment.begin),
                    "end": gf.time_to_ssmmm(fragment.end),
                    "children": visit_children(child, level + 1),
                    "type": level
                })
            return output_fragments
        output_fragments = visit_children(self.fragments_tree, 0)
        return gf.safe_unicode(
            json.dumps({"fragments": output_fragments}, indent=1, sort_keys=True)
)
johnking commented 7 years ago

@readbeyond , Hi Alberto, Thanks for your sharing, really appreciate it.

I am developing an App and want to reuse/expand the JSON structure, I will share my idea once I finish the prototype.

thanks again.

-John