Peek for Brackets - Githubissues

Measter commented 10 years ago

I hope this is the right place for feature requests, sorry if it's not.

Would it be possible to add a function to see if the next read data is inside brackets without returning that data?

For example, we have this file:

tag1 = val1
tag2 = {
    tag3 = val2
}

But if we don't know what the tags are and are storing them in a generic structure, as far as I can tell, there's no way to predict that tag2 contains bracketed data without reading the data within the brackets. If you do ReadString(), it will increase the indent, but now you've read the data and need to now handle two tags at once making it more complicated.

It would be much simpler to be able to do something like this:

Action<ParadoxParser, string> getTagData = ( p,s ) => {
    Tag tagData = new Tag();
    tagData.ID = s;

    if ( p.NextIsInGroup() )
        HandleGroupTag( tagData, p, s );
    else
        tagData.Value = s.ReadString();
};

nickbabcock commented 10 years ago

This is the perfect place for feature requests. Thank you for reaching out.

I've wanted this feature too at certain points, but every time I've gone to try and implement it, I realize that I actually know the structure and I specify it with a IParadoxRead class or as an Action as you have shown. If you don't know the structure then there are ambiguous tags that can't be resolved. For instance, let's say have a file with the following line:

core=THM

One doesn't actually know if core is a string or a list as next example shows a common scenario

core=THM
[...]
core=HOI

I'll make a writeup and add it to the readme that goes more in depth about some of these issues.

Can you show a use case that you have that warrants this feature? I've been hacking on this project a lot lately trying to get it into really good shape, so any help or suggestions are much appreciated.

Measter commented 10 years ago

What I would like to use it for is parsing the gain_effects and allows segments of title definitions in Crusader Kings II. Title names are arbitrary past the first two characters, so the template system can't be used.

The gain_effects and allows segments of the titles are basically the full scripting system, so it can be very complex. For example, this is from e_latin_empire

    allow = {
        OR = {
            religion = catholic
            is_heresy_of = catholic
        }
        e_byzantium = {
            has_holder = no
        }
        e_roman_empire = {
            has_holder = no
        }
        custom_tooltip = {
            text = controls_kaliopolis_thrake
            hidden_tooltip = {
                495 = { # Kaliopolis
                    owner = {
                        OR = {
                            is_liege_or_above = ROOT
                            character = ROOT
                        }
                    }
                }
                497 = { # Thrake
                    owner = {
                        OR = {
                            is_liege_or_above = ROOT
                            character = ROOT
                        }
                    }
                }
            }
        }
    }

Something like that would be much simpler to handle with a generic parser which parses the data into generic Tag objects containing the tag id and data assigned to it, whether it contains sub-tags or single data.

Measter commented 10 years ago

I've implemented a bracket "peek" function and test in my fork of the project. I think I've thought of the possible issues it could have, but could you have a look over it?

nickbabcock commented 10 years ago

EDIT: I see you have commented while I was writing this comment. I'll take a look, but please still read the rest of my comment and let me know your thoughts :smile:

So I've decided to sleep on what you said and I have a couple of opinions. My first thought is that I think you can do it with the current parser. Do note that I have never played CK2, so what I may descibe may be erroneaous. If so, let me know. The OR tags seem pretty simple, just strings matching to strings. The trickiest part might be e_byzantium and e_roman_empire, but since you are parsing a game file, I'll assume you can parse the file that defines all the countries. Then you could test if the token returned was a country. So off the top of my head I would write something like:


public class AllowCollection : IEnumerable<object>, IParadoxRead
{
    private List<object> objs;

    // This could be static if it applies
    private readonly HashSet<string> countries;    

    public AllowCollection(HashSet<string> countries)
    {
        objs = new List<object>();
        this.countries = countries;        
    }

    public void TokenCallback(ParadoxParser parser, string token) 
    {
        switch (token)
        {
        case "OR": objs.Add(parser.Parse(new OrTag()); break;
        case "custom_tooltip": objs.Add(parser.Parse(new CustomTooltip())); break;
        default:
            if (countries.Contains(token))
                objs.Add(parser.Parse(new CountryTag(token))); break;
            // else we have no idea ...
        }
    }

    // Implement IEnumerable stuff ...

}

I obviously don't know your needs, so you may need to change it to how you see fit. The benefits with this approach is that after parsing the text you don't need to disect the structure as much. Sure this may require more upfront work, but I think it will pay off in the long run. I say this because nothing is infinite, there are always a limited number of choices.

However, despite everything I 've just said, I've been thinking and it should be possible to support some kind of generic parsing. A paradox file could be defined as a dictionary<string, object> where the object is either a string, list or a sub dictionary.

Some implementation notes (skip over if you are not interested):

The object can't be a plain string as it must record data on whether quotes were encountered, so that it can be written out correctly. Some sort of ParadoxString class would have to be created to store this information along with the original string. The same thing would need to be done with lists, to store whether the elements were consecutive.
If a string encountered is already in the dictionary, then we are dealing with a list (with nonconsecutive elements). We'd have to take the object out of the dictionary and test if it was a list, and if so, add the new element to the back of the list. Else if the object isn't a list, take the old object and create a list with that as the first element and append the second element.
When reading inside brackets, you don't know if you are areading some kind of consecutive elements list or an object until you get to the first token and see what the next token is. An '+' would denote we are in an object, while another untyped token would mean we are in a list. THis may involve backtracking, which I'm not a fan of at this moment.
I'm unsure if the current parse has the capabilities. Definitely the existing functionalities can't be leveraged like ReadList.
I'm planning on dropping the dictionary interface (#19), so it would probably be something like IList<Tuple<string, object>> or IList<KeyValuePair<string, object>>

How does that sound? If you'd like to implement it, go right ahead, else I may take a crack at it this fourth of July weekend. I will admit that I don't see an urgent need for this functionality as it will be slower and more memory intensive than the traditional way. And at the end of it all, you're still going to need to "parse" the structure once you're done anyways.

Measter commented 10 years ago

The title IDs are not required to be defined before they are used, so your solution for reading them wouldn't work. They do have a predictable pattern, though.

In terms of tag predictability, the game scripting is pretty consistent in use but has a lot of possible tags. It would be possible to specifically support all the tags, but you would have to explicitly check for all the scope tags, which is possible but unrealistic.

In my implementation of the "bracket peek", I extended how the PeekToken() function works by replacing the nextToken? with a Queue<LexerToken>, allowing it to peek ahead multiple tokens and still cache them.

In the unit test I wrote, I gave two id tags, one with brackets, one without, and showed (with a contrived example) how it such a check could be useful.

nickbabcock commented 10 years ago

I took a look at your code, it's clean and I get the idea. I like the test case, but I wasn't overly convinced that other parts of the program went unaffected because so many functions rely on GetNextToken, but if all the tests passed then that's great!

I'd appreciate if you use your new feature and try to implement something not contrived (no need to make it a unit test, just use it on your ck2 project), and let me know how it works :wink:

If it works well, then I'll merge your changes, else I'll give you more time to more fully develop the feature. How does that sound?

Measter commented 10 years ago

I've implemented an example of a partially generic reader using my changes, and uploaded it here along with generic storage classes. It's partial because at this time, there's no way to predict if the bracketed data contains tag = val type data or a plain list of values.

The repo also comes with an example title from CK2 showing how, with a simple addition to handle the color and color2 tags, the generic parser can handle the entire title without specifically handling most tags.

Another issue with doing generic parsing is that you also can't take advantage of the parser's ability to read other types of data (int, bool, float), and have to read it as a string and manually parse it. If you know the data structure, it may be better to write specifically for that structure. When parsing a title, for example, it would be better to handle most of the tags specifically, but use a generic handler for the gain_effect and allow tags.

nickbabcock commented 10 years ago

[...] there's no way to predict if the bracketed data contains tag = val type data or a plain list of values.

Exactly. This is why I'm not a fan of the format. Even if you see tag = val, tag can still be a list! It's pretty crazy. That is why, even though the parser takes care of the low level parsing, the client still needs to be intimately familiar with the structure of the file. It sucks, I agree, especially because how complex Paradox games tend to be. You brought up scopes before, and yes, there are a lot, but you could scrape the info from the wiki and create a T4 template that would generate all the code for you. It would only have to be done once.

Ask yourself, how do you think Paradox parses their own format? I can't explicitly say due to NDA, but it's just how you would imagine. "if that if this switch over here...". They put in the time to read/write their own format (for better or for worse) and I can't think of a better way than for people to duplicate that logic. Ideally, people would make open sourced libraries that would be able to parse all the information from a given file, and then everyone would just use that library. That's what I'm trying to do with EU4.Savegame (still has a long ways to go). The hope is that people wouldn't have to duplicate logic and just use the library whenever they wanted to do anything with a savegame. It would dramatically lower the barrier to entry for people making tools.

[...] the generic parser can handle the entire title without specifically handling most tags.

Yup, only parsing some tags is definitely a way to get started fast and show something in front of the user. Just don't expect to be able to write info. The writing functionality only works if everything is able to be written out. Previous iterations of this library allowed for select writing of data, but it proved too tedious and easy to make mistakes.

Another issue with doing generic parsing is that you also can't take advantage of the parser's ability to read other types of data (int, bool, float), and have to read it as a string and manually parse it. If you know the data structure, it may be better to write specifically for that structure.

Agreed

When parsing a title, for example, it would be better to handle most of the tags specifically, but use a generic handler for the gain_effect and allow tags.

But you would still have to parse it anyways, though, right? If you parse it into some generic class, you would need to disect the class to get any info out of it. So for instance, let's say generic parsing was added to the parser:


IDictionary<string, object> allow;

public void TokenCallback(ParadoxParser parser, string token)
{
    if (token == "allow")
    {
        allow = parser.ParseGeneric();
    }    
}

Whenever you wanted to do anything with that structure you would have to make sure the key was there, and then you would have to cast the value to whatever you thought it should be. But since you just did that, you know the at least some of the structure, and you'd benefit from writing the logic inline instead of parsing generically.

Measter commented 10 years ago

I feel you may have misunderstood; I'm not suggesting that a full generic handler be added to the project, only the bracket peeking feature. This would allow the user of the library to do generic tag handling if they desire. Only what I've implemented in my fork of the project is what I was asking for in the top post.

In your save game example, and most structures in the game files, you wouldn't need generic data structures to store it, because there are relatively few tags. For the event system, however, I think it would be better to have more generic structures because so many tags share the same basic characteristics. Of course, one big difference would be whether you intend just to store the objects to view in some way, or if you intend to execute it.

In terms of a use case for the feature, I'm replacing the parser in my Title Generator for Crusader Kings II because this one is faster (the smallest gains I've seen are cutting the time in half). I've been able to handle most of the data files, but I ran into a problem with the titles. I must handle the gain_effect and allows scripting while parsing, even though I'm not actually going to be doing anything with it.

Relying on the current indentation would likely result in very messy code, because the the titles may or may not be nested. A d_title_a may or may not be inside a k_title_b and as you see in the example above, the title tags do occur in the scripting. As such, the ability to detect whether the current tag has bracketed data, would allow much cleaner parsing.

nickbabcock commented 10 years ago

I feel you may have misunderstood

Yeah, I think I'm misunderstanding you because I haven't taken a peak at how those files you are dealing with are structured, so I have a hard time fathoming your problem. Initially I thought your peak bracket idea was part of a larger scheme to introduce a generic parser, which is why I went off on that tangent a couple of comments back.

Personally, I don't know about you, but I'm liking this dialog :smile:

In your save game example, and most structures in the game files, you wouldn't need generic data structures to store it, because there are relatively few tags

Quick aside, but I'm hoping to keep the reading/writing code for the save game under 8,000 (most of it auto-generated) lines of code. A single country can have up to 172 top level tags (there are still more that I have yet to set up logic for)!

For the event system, however, I think it would be better to have more generic structures because so many tags share the same basic characteristics.

You kinda lost me here. If so many tags share the same basic characteristics, it sounds like they can share the same parsing logic.

In terms of a use case for the feature, I'm replacing the parser in my Title Generator for Crusader Kings II because this one is faster (the smallest gains I've seen are cutting the time in half).

Hey wow, thanks! I've fine tuned the parser to churn through a 50MB file or 10,000 files in a blink of an eye. You must then understand why I'm not overly thrilled at having a queue introduced in the parser and the speed implications. If GetNextToken is called several million times, which is likely in my use cases, enqueuing and dequeuing several million objects is quite a lot! What I'm trying to get at -- users shouldn't pay for features they don't use. I think your feature definitely should be incorporated, I'm not sure how. Heck, I wouldn't mind if there was a parser for speed and parser for features if it came down to that.

I took a look at your project, and yes, I want to work with you to make sure that everything you want can be done. That's what you should take away from this comment.

Measter commented 10 years ago

Personally, I don't know about you, but I'm liking this dialog :smile:

As am I.

172 tags, eh? That is a couple. For the tag characteristics bit, that's what I meant. Similar tags should share parsing logic and basic data structure.

You must then understand why I'm not overly thrilled at having a queue introduced in the parser and the speed implications.

I had a look at Microsoft's implementation of the Queue class, and under the hood they're just using an array, keeping track of the head and tail, not all that unlike your string buffers. The initial size could always be set if it's really required, though given the array would only be resized a couple times it's not really necessary. The performance hit is pretty much as low as you could get for a queue structure.

nickbabcock commented 10 years ago

Ok I took a look at the source too, looks ok. I suppose neither of us can make a judgement call until tests are actually ran. Saying that, go ahead and create a pull request whenever you are done developing your feature, and we can go from there.

Also I see that your project is targeting .NET 3.5 and this is targeting .NET 4.0, is this a problem? If so, feel free to create another pull request and have the project target 3.5.

Measter commented 10 years ago

I just tested the generic handler project with the 533kB landed_titles file from CK2, and (after discovering that NextIsBracketed() breaks right curly detection in DoWhileBracket()) found that it parses the entire thing correctly in 34ms. A bigger test would of course be with a very large file.

I'd already started replacing parts of my project with this parser and I'd already switched to .NET 4.0, so that won't be a problem.

By the way, just throwing the landed_titles file at that generic parser took less than 15% of the time of the parser I'm currently using. That will probably change when I make a specific parser for the title files. Though it should be noted that the result of the generic parse is fairly similar to the abstract syntax tree the current parser returns.

Measter commented 10 years ago

I had a better idea of how to implement bracket peeking without altering the state of the parser and breaking things. I've created a pull request with the new code.

nickbabcock / Pdoxcl2Sharp

Peek for Brackets #18