idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby
17 stars 2 forks source link

Distinguish bold and italic in highlights #50

Closed thomasopsomer closed 5 years ago

thomasopsomer commented 7 years ago

Hi,

I want to be able to filter bold or italic among highlight extracted by json-wikipedia. I'm planning to address that, is this something you're also interested in ?

tgalery commented 7 years ago

hy @thomasopsomer yeah we are defo interested in that. It could be that some of the dependencies or json wikipedia like dkpro, bliki or sweble already have ways to do so. Let me know if you need help getting this started.

thomasopsomer commented 7 years ago

Actually the setHighlights method of the ArticleParser shows how to get italic and bold mentions (https://github.com/idio/json-wikipedia/blob/development/src/main/java/it/cnr/isti/hpc/wikipedia/parser/ArticleParser.java#L411).

I'm wondering if it's better to create two fields: BoldHighlights and ItalicHightlights, or to make highlights be a list of custom object with a type field indicating if it's a bold or italic. I would also like to keep the information about which paragraph highlights come from. So either having a list of list or having a field paragraphId. Do you have any recommendation about that ? :)

edit: In fact the easiest would be to handle highlights the same way you handle links in ParagraphWithLinks. i.e. adding a List<Highlight> highlights to ParagraphWithLinks and fill it when parsing paragraphs...

tgalery commented 7 years ago

hi @thomasopsomer, I'd agree with your edit, probably the best would be to add it to ParagraphWithLinks. Maybe we can simply add a "type" value in the constructor that could be "italics", "bold" (or maybe "unknown" if there are other types).

tgalery commented 7 years ago

also, don't want to create extra problems, but having highlights, specially bold forms, might be interesting for the ustream repo https://github.com/diegoceccarelli/json-wikipedia (maybe @diegoceccarelli wants to say something), which doesn't have ParagraphWithLinks, so maybe having a "highlights" key in the page with a paragraphIndex might be a good compromise.

thomasopsomer commented 7 years ago

Yep my first reasoning was not to break too much the original json-wikipedia, and as ParagraphWithLinks was quite a big change to the original repo I thought it's not a big deal ^^. (besides why do you still populates Paragraphs and Links in addition of ParagraphWithLinks ?)

Looking at both code from the upstream repo and this one, it looks like in both case processing needs to differ because both repo are not using same "paragraphs", this one extend default paragraphs with list and tables ... in AllParagraphs.

tgalery commented 7 years ago

Cool, if that's the case and you feel like giving a what at it here, go for it and maybe we can backport it later.

diegoceccarelli commented 6 years ago

Hi, I would specialise the field highlights to contain the type of highlight, something like:

"highlights": [ { "type" : "italic", "text": "london"}, { "type" : "bold", "text": "new-york"} ]

in this way we could add new types if needed, and add extra attributes (e.g. paragraph-id, or the position in the article..)

@thomasopsomer feel free to submit the patch also on the main repo. @tgalery is idio version very different from the main one? could we merge the differences?

tgalery commented 6 years ago

Hi @diegoceccarelli, unfortunately the idio version is quite different from the main one. We had to add fields for the paragraph ids back in 2015 and that changed the schema since idio at the time introduced paragraphsWithLinks instead of simply paragraphs. So the merge would be quite tricky at this point. We are removing some differences from the codebase, but given current resources, things look a bit tricky.

I think that paragraphsIds would be a must for this PR though.

@thomasopsomer if you want, you can fork from the original repo and submit a pr there, then I can create a branch from your branch and try to incorporate things here. You might have to branch from a more common point in history if you wanna minimize effort, but I'd be happy to port things too.

thomasopsomer commented 6 years ago

@diegoceccarelli thanks for your input :) I had introduced the change into a new attribute of paragraphsWithLinks, but I'll come back to a highlight object for the highlights attribute so that it's easier to share between the two repo.

@tgalery yes paragraphsIds is clearly a must and perhaps character offsets too.