gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine
Other
1 stars 0 forks source link

[Template] Number of headings #840

Open mguidoti opened 4 years ago

mguidoti commented 4 years ago

Hi Guido,

A couple of questions regarding headings, to whenever you've time.

  1. "Abstract/Résumé" headings in this specific journal we're working on are in bold, while other level 1 headings like "Introduction" and "Material & Methods" have different font properties. We're considering adding them as heading 1 and 2, respectively, although we technically believe they're at the same level. What do you think?

  2. Also, we know that we can tell GGI that we've more than 3 level of headings (there is a parameter for that), but once we set this parameter for, let's say, 4, a fourth heading option doesn't show - so we can't define its parameters. Is this the expected behavior?

Essentially the problem is that we identify up to 5 levels of headings but we can only set up three in the GGI:

How do you suggest us to proceed?

Here's the link to a PDF that contains all of these headings, if you want to take a look: https://drive.google.com/open?id=13B_LQN-BSycFlK911VQ_m1gFZvYMq5L-

gsautter commented 4 years ago

Regarding (2): This is not the expected behavior, but something I need to fix in some kind of way ... didn't have the need thus far, though. I'll add the parameters for a fourth level of headings in the next update.

mguidoti commented 4 years ago

Thanks Guido.

If possible, maybe add for a fifth level, considering that we found the need already?

gsautter commented 4 years ago

OK, you got the fifth.

gsautter commented 4 years ago

Regarding (1): This sounds complicated, as the one thing the whole template approach assumes is coherence in terms of styling ... I'll definitely need to look at that example PDF to say more, and for that, you'll need to grant me access.

mguidoti commented 4 years ago

I thought you already had access, sorry.

Now you have.

And thanks!

gsautter commented 4 years ago

Looking at the example, things don't look all that bad ... I suggest the following:

The only real headache is "Acknowledgements", which technically needs to be at the same level as "INTRODUCTION", "REFERENCES", etc. to produce the correct document structure ... might be we need to resort to a combination of start patterns and font size for Level 1, which should be capable of modeling what we need.

mguidoti commented 4 years ago

OK, we'll give a shot.

Thanks, Guido.

mguidoti commented 4 years ago

Just to make clear to us: are you going to add the other two levels, or you concluded that we should try to use start patterns and resume all these headings in the three available levels?

mguidoti commented 4 years ago

If you want to look another PDF example, this has subheadings under Material & Methods, which would be different than the subheadings under a given treatment.

https://drive.google.com/open?id=1Kz43Z988s2PKVADOg1AQQUL2M5PLhuQL

gsautter commented 4 years ago

I did add the parameters for Level 4 and Level 5 headings, and they will come with the next update like I said. It's simply that in this example I think 3 levels of headings should be sufficient (also in the second PDF) - the only intent behind the suggestion to use start patterns was to get "Acknowledgements" on the same level as "SYSTEMATICS" and "REFERENCES". The latter is important so (a) "Acknowledgements" ends the treatment ending right before it and (b) make sure "Acknowledgements" doesn't get dragged into "DISCUSSION" in the second PDF.

Background: A section started with a Level 1 heading runs up to the next Level 1 heading; a section under a Level 2 heading runs up to the next heading of Level 2 or 1; a section under a Level 3 heading runs up to the next heading of Level 3, 2, or 1, etc. That said, getting the heading levels right is important for the document being correctly structured.

mguidoti commented 4 years ago

Got it. Thanks a lot, Guido.

What happens if, let's say, our heading 2 parameters are getting the heading 2 as it's suppose to be, but also getting the level 1 headings (which are also being caught by the heading 1 parameters)?

Not sure if I was clear, if not, sorry, I'll try to illustrate the question with examples.

gsautter commented 4 years ago

No need to illustrate ... in brief, the scenario you describe would not be good. If that is due to font sizes (which I assume it is, in particular "Acknowledgements" being too small), you might want to try and find a way of making this unambiguous ... need to think about how exactly we might achieve this, though.

mguidoti commented 4 years ago

In this particular case, besides the similar font size, one is all-caps and the other is small-caps, which we can't tell them a part by parameters (small-caps are catch if we set all-caps, for instance).

I've no idea how hard it would be to add small-caps as a parameter throughout the template creator, but that might be a good idea, in my humble opinion.

gsautter commented 4 years ago

There is an "All Caps" flag already, and since GGI joins words that semantically are one, there is no difference between all-caps and small-caps from the point of view of the heading extraction code. That conflation of meanings is outright necessary considering the fact that small-caps are mostly rendered as all-caps with a reduced font size.

gsautter commented 4 years ago

As I stated above, I need to think about this one.

mguidoti commented 4 years ago

Sure thing, and thanks a lot for this support!

If you need anything from us about this matter, just ping me.

myrmoteras commented 4 years ago

@gsautter @mguidoti please make a file with changes to suggest to the MNHN journal editors for the lay out. They are open for this. When Guido is in Paris, we can briefly bring it up at the side of the meeting with Métope

mguidoti commented 4 years ago

Here, @myrmoteras https://docs.google.com/document/d/10cogklYkDUfVXd9_PwEbR5mypx_jxNsKHr_y2gIWd6M/edit?usp=sharing.

Not sure where else I could place this link?

myrmoteras commented 4 years ago

@gsautter @mguidoti please look at this issue also from the downstream issue of producing taxpub. I wonder, whether this is possible with so many header levels, that in fact often should be the same, such as "abstract" should have the same as "acknowledgment" or "introduction"

mguidoti commented 4 years ago

If creating many levels might be a structural problem - and I do believe it might be - perhaps we should consider the possibility of creating more than one condition to define each heading level in GGI?

Say, three different 'tabs' for heading 1 so we can define ABSTRACT (all-caps, bold), INTRODUCTION (all-caps) and Acknowledgment(bold) as the same level.

Would that be possible, @gsautter ?

What do you think?

gsautter commented 4 years ago

@mguidoti that's surely a way of looking at it ... however, the option to have multiple definitions for a single heading level at the same time is something that will take some remodeling of the respective template parameter groups and is not as easy a thing to implement. I'll need to do some thinking about how to realize this to make sure we don't end up creating a mess ...

myrmoteras commented 4 years ago

please, before you even start thinking about this. discuss with @tcatapano about the levels. The goal is NOT the be able to create the exact same format as the original, but what corresponds to the semantics of the article, and the granularity we mark this up.

So, before starting investing into this, get Terry involved.

mguidoti commented 4 years ago

Ok, does Terry have access to this repo? If so, we can ping him here.

But before doing that, I'll illustrate the challenges we're facing with the MNHN journals (Zoosystema, Geodiversitas, Anthropozoologia [has no treatments], Adansonia) and the current set of parameters. Besides illustrating the issue with a real world problem, we might also benefit from Guido's tips. Ok?

I used patterns to get all level 1 headings, and the recipes was working just fine (figure below), however, after re-loading the document and running the detect document structure tool, some of the headings that were suppose to be marked, weren't. image

Note that the minimum font size was necessary to get Abstract and Résumé as headings.

Still on heading 1, look at this case of Anthropozoologica. The string has the same font size and properties, but it didn't worked.

image image

And also, we're having trouble with the heading 3 because of the small-caps, which has the initial letter on font size 11 and the following ones 9, meaning that we have to use the min and max font size attributes, which will get also what is suppose to be our heading level 1. image

We consider to have everything else on place - we're just the headings away of running the first tests with these templates.

Any guidance will be helpful, Guido.

Thanks a lot.

gsautter commented 4 years ago

I surely understand that recreating the exact format is not the goal. And it's not the question, either ... the question is how to deal with a case of headings that are at the same logical level and yet use different formats, aka how to model multiple heading styles for one level in the templates so GGI can recognize them ... in brief, this is about recognizing heading formats, not about recreating them.

mguidoti commented 4 years ago

Insecta Mundi has a different but equally interesting challenging: first-level headings sharing the same line as normal, non-heading text (and with different style settings as the remaining first-level headings as well, like Introduction and Results, for instance).

image

gsautter commented 4 years ago

I don't think you need to worry about "Abstract" and "Key words" here ... while it's nice to have them in their own respective subSections, we don't lose all too much (in our current process, nothing) if they are conflated with the document_head subSection.

mguidoti commented 4 years ago

Right.

Thanks, Guido.

mguidoti commented 4 years ago

So, Guido, check this out.

Even consider a system with three levels of headings as such:

  1. h1 = major document structural sections like introduction and references
  2. h2 = treatment 'title'
  3. h3 = treatment subheadings/subSubSections

And even ignoring different stylistic settings for h1 subSections like "Abstracts", "Key words" and "Acknowledgement" because they're less important, we're still facing severe issues because:

  1. Journals that have H2 and H1 with such stylistic settings that the current parameters can't tell them apart (e.g., small caps that mess up everything);
  2. Journals that have different heading levels with EXACT same stylistic settings.

With this in mind, I'm interested in understand how the heading markup actually works. Say, we have the style all set for H1, but H2 will get both what is suppose to be H1 and what is suppose to be H2. How GGI will mark these things? Will it first mark H1s, and then when it runs the H2 settings, will only mark the ones that weren't mark yet?

How do you think we should proceed?

I know you're looking at this issue, but I felt like this is 'new' to you as well, and could be useful.

Thanks

gsautter commented 4 years ago

Those three heading levels sound about right, yes.

Regarding the two issues: (1) We might consider something about font sizes here, as small-caps tend to be emulated as all-caps at a smaller font size. Examples would be helpful for getting a better grasp of what we're up against. (2) This looks like a real problem ... how do you even tell apart the heading levels if they look exactly the same?

Regarding detection of the heading levels, your assumption is correct: The gizmo uses the top-most matching style. That said, you might be able to tell two heading levels apart even if the layout is very similar by using a fixed list of values (molded into a pattern) for the higher level.

mguidoti commented 4 years ago

(1) The problem with small caps in the cases we faced, like in the four MNHN journals from last week, is that hte first letter was on the exact font size as the all-caps heading 1, while the other letters were in a smaller size - meaning that we had to use the min and max font sizes, and not the fixed font size parameter. Because of that, h1s were caught too.

(2) By logic. Like, the treatment title, which is really impossible to be confused for taxonomists like us, and then 'description', 'material examined' and things that clearly belongs to the treatment.

We're going with this solution (fixed value patterns for h1), but of course journals vary those things eventually as well...

gsautter commented 4 years ago

(1) The handling of font sizes is by no means engraved in stone, and neither is the set of available parameters, so as small-caps appear to become an issue, I might well amend the logic to work with small-caps as well ...

(2) Fixed lists of headings at specific levels most likely are the way to go if a human reader's logic is the only way of telling them apart, I'm afraid. Of course, there will be some variation, so the lists might require amending over time ... but then, hey, constantly amending things to handle new cases or challenges has been my life for years ... just the way it is, I'm afraid ... the data never runs out of surprises.

mguidoti commented 4 years ago

(1) I really believe small caps is often used and we'll be frequently exposed to similar situations, unfortunately... Sorry ;(

(2) That's a hell of quote, Guido: "the data never runs out of surprises". Haha.

gsautter commented 4 years ago

(1) Nothing to be sorry about, see my comment on (2) ... let's collect a good bunch of example IMFs, and then I'll get to work on that.

(2) Experience, nothing but experience ... no matter how perfectly something handles a "representative" test set of documents, soon as you hand over whatever you built to the users, the "doesn't work" tickets start coming.

mguidoti commented 4 years ago

Ok.

I've created a shared folder where we'll start to save different IMFs with special cases, so you can take a look and explore.

Also, there is this newly created Google Spreadsheet where we'll record the file name and a short description of the presented challenged, regarding headings.

I hope this will be helpful. The team here is already instructed to fulfill these things from now on.

Thanks, Guido.

flsimoes commented 4 years ago

Sorry about the duplicate Guido, I'm really not trying to swamp you. I didn't realize that I should follow the nested discussion in this ticket, as I saw Zoosystema as an individual situation (the other guys have already solved the headings for the other MNHN journals). My bad!

gsautter commented 4 years ago

As I was reading this ticket, the discussion included Zoosystema ... if it doesn't, the the bad is all mine about #869 ... I'm aware the headings are a bit of a problem, and a solution is in the works, most likely in the form of allowing multiple templates per heading level to accommodate filters that don't combine without softening up beyond working properly.

But yes, I'm afraid the level 2 headings are indeed a different issue, if one that should also resolve with this multi-template approach: Either there is a bold taxon name in the lead, or the lead word is "Genus", "Family", etc., which would model via patterns in a second heading style on the same level. The paragraph start lines should filter out via the alignment when you actually run the template, but if they don't, we'll think of something ... maybe that lines have to be shorter than block width ("strictly centered", which would get us rid of the justified lines), or that there needs to be a minimum amount of space above them.

flsimoes commented 4 years ago

Thanks Guido. I'll keep testing the parameters and will try to see if I can also think of something else to help. I've noticed that the publications in this journal vary the positioning of this headings quite a bit (sometimes they are separated in blocks, sometimes it is one line for each of them... basically no real pattern).

gsautter commented 4 years ago

We'll need to find some kind of pattern ... that's what the templates are all about ...

flsimoes commented 4 years ago

Indeed! The main patterns are the ones you already mentioned ("Genus" and centered, etc).

gsautter commented 4 years ago

Just put a new build online, comprising the ability to use multiple styles per heading level / assign headings matching different styles to the same heading level, which are obviously two sides of the same coin ... let me know if you can work with that.

myrmoteras commented 4 years ago

super!

flsimoes commented 4 years ago

Thanks a lot for all the changes Guido! We have just started testing them out and we'll provide you with the feedback soon, but from the onset they already seem to be working!

On a sidenote (not sure if I need to open a new issue for this), I realized that upon updating all computers, the templates we created here were deleted from the DocumentStyleProviderData folder; this wasn't a big issue, as we had backups, but I felt I had to point that out anyway. I am not sure if this behavior is to be expected, but my guess would be that GGI only saves the templates stored in the main server, which actually would make sense to me in terms of keeping the installation uniform between computers.

mguidoti commented 4 years ago

Outstanding news, @gsautter! Thanks a LOT. We're all busy testing all of these features right now. They came in a good timing as well, as we're almost running our first batch-processing tests. Now we can tweak the templates a bit further.

gsautter commented 4 years ago

@flsimoes GGI surely doesn't delete any templates, just leave them behind when updating the configuration ... there must be a folder <GGI>/Configurations/Default.imagine.<someNumbers>.old/Plugins/DocumentStyleProviderData where they still are. They are just not copied over to the newly downloaded Default.imagine, so no worries.

That aside, I'm well aware we need a centralized solution for storing and sharing the templates, already thinking up possible implementations.

flsimoes commented 4 years ago

@gsautter Thanks! Yes, they are indeed there!

By the way, since we are on the topic of headings, I've had a thought onhow to deal with "headings 3" (things like "Remarks", "Description", etc). Please correct me if I'm wrong at any point.

What if we added a Value Pattern in addition to, or instead of, a Start Pattern? My rationale here is this: quite often, journals put these headings in the same paragraph (and line) as the start of their related text; since Start Patterns won't help you tell where the matching should stop, GGI will end up matching the whole line, correct? However, if we could input Value Patterns, we would be able to tell the software to stop matching at a separator (most often a full-stop after one or two words, or even an em-dash). Example: ([A-Z][a-z]+\s*(?!\—))

Apologies in advance in case the explanation was overly complicated.

gsautter commented 4 years ago

What you describe are in-line headings ... no need to mark those via the templates. The gizmo marking treatment structure (the subSubSections) has means in its rule set that match on exactly there emphasized line starts coming from a pretty much controlled vocabulary. On top of that, it has a routine dynamically tagging emphasized paragraph starts that repeat across treatments, so this is taken care of further downstream.

flsimoes commented 4 years ago

Perfect, thanks for the explanation!

mguidoti commented 4 years ago

Good to know!

Thanks!

What you describe are in-line headings ... no need to mark those via the templates. The gizmo marking treatment structure (the subSubSections) has means in its rule set that match on exactly there emphasized line starts coming from a pretty much controlled vocabulary. On top of that, it has a routine dynamically tagging emphasized paragraph starts that repeat across treatments, so this is taken care of further downstream.