Closed thomasopsomer closed 5 years ago
hy @thomasopsomer yeah we are defo interested in that. It could be that some of the dependencies or json wikipedia like dkpro
, bliki
or sweble
already have ways to do so. Let me know if you need help getting this started.
Actually the setHighlights
method of the ArticleParser
shows how to get italic and bold mentions (https://github.com/idio/json-wikipedia/blob/development/src/main/java/it/cnr/isti/hpc/wikipedia/parser/ArticleParser.java#L411).
I'm wondering if it's better to create two fields: BoldHighlights and ItalicHightlights, or to make highlights be a list of custom object with a type field indicating if it's a bold or italic. I would also like to keep the information about which paragraph highlights come from. So either having a list of list or having a field paragraphId
. Do you have any recommendation about that ? :)
edit:
In fact the easiest would be to handle highlights the same way you handle links in ParagraphWithLinks. i.e. adding a List<Highlight> highlights
to ParagraphWithLinks and fill it when parsing paragraphs...
hi @thomasopsomer, I'd agree with your edit, probably the best would be to add it to ParagraphWithLinks
. Maybe we can simply add a "type" value in the constructor that could be "italics", "bold" (or maybe "unknown" if there are other types).
also, don't want to create extra problems, but having highlights, specially bold forms, might be interesting for the ustream repo https://github.com/diegoceccarelli/json-wikipedia (maybe @diegoceccarelli wants to say something), which doesn't have ParagraphWithLinks
, so maybe having a "highlights" key in the page with a paragraphIndex might be a good compromise.
Yep my first reasoning was not to break too much the original json-wikipedia, and as ParagraphWithLinks
was quite a big change to the original repo I thought it's not a big deal ^^. (besides why do you still populates Paragraphs and Links in addition of ParagraphWithLinks ?)
Looking at both code from the upstream repo and this one, it looks like in both case processing needs to differ because both repo are not using same "paragraphs", this one extend default paragraphs with list and tables ... in AllParagraphs
.
Cool, if that's the case and you feel like giving a what at it here, go for it and maybe we can backport it later.
Hi, I would specialise the field highlights to contain the type of highlight, something like:
"highlights": [ { "type" : "italic", "text": "london"}, { "type" : "bold", "text": "new-york"} ]
in this way we could add new types if needed, and add extra attributes (e.g. paragraph-id, or the position in the article..)
@thomasopsomer feel free to submit the patch also on the main repo. @tgalery is idio version very different from the main one? could we merge the differences?
Hi @diegoceccarelli, unfortunately the idio version is quite different from the main one. We had to add fields for the paragraph ids back in 2015 and that changed the schema since idio at the time introduced paragraphsWithLinks
instead of simply paragraphs
. So the merge would be quite tricky at this point. We are removing some differences from the codebase, but given current resources, things look a bit tricky.
I think that paragraphsIds
would be a must for this PR though.
@thomasopsomer if you want, you can fork from the original repo and submit a pr there, then I can create a branch from your branch and try to incorporate things here. You might have to branch from a more common point in history if you wanna minimize effort, but I'd be happy to port things too.
@diegoceccarelli thanks for your input :) I had introduced the change into a new attribute of paragraphsWithLinks
, but I'll come back to a highlight object for the highlights attribute so that it's easier to share between the two repo.
@tgalery yes paragraphsIds
is clearly a must and perhaps character offsets too.
Hi,
I want to be able to filter bold or italic among highlight extracted by json-wikipedia. I'm planning to address that, is this something you're also interested in ?