toolbox4minecraft / amidst

Advanced Minecraft Interface and Data/Structure Tracking
GNU General Public License v3.0
2.13k stars 240 forks source link

Search maps for specific biomes near spawn #120

Open mgod opened 8 years ago

mgod commented 8 years ago

I noticed there have been a few requests for searching for seeds by features/biomes. These requests have generally been pointed to scripts in various forums. I'm not sure if there's a policy reason not to have these kinds of features in the core app, but I've written a prototype filter for finding seeds with specific biomes near [0, 0] and was wondering if it made sense to clean it up and open a PR for it.

I'm confident I can build the back end for this in a well structured way, but I'm unsure of what the UI would look like if I add this, so I'd want feedback on that before spending too much time on it. In it's most basic form, it seems like this could be done with users adding there own feature_filter.json similar to history.txt.

stefandollase commented 8 years ago

Yes, this is a great idea. I also thought about adding something similar to the SeedFinder by CodeRaider to Amidst. However, since this is a pretty big project and there is much other work to do I did not even create a feature request for it, but this does not mean that you cannot do it.

Now, a good question is, whether we should use a json file or a GUI to configure the finder. The GUI might be easier to use, however the code is easily copyable and thus it is easier to share in the forum. As a programmer, I clearly prefer the code, however this might be too complicated for the average user. We also might add a dual interface, which displays code and a GUI, but this would not be simple at all. Maybe we should start with a simple json file and enhance the user interface when the finder is working?

mgod commented 8 years ago

Starting with a JSON interface makes a lot of sense to me. I'd rather limit this to people with some willingness to do a little digging as it's easy to add too many filters. I'm currently running searches where I want at least one mooshroom biome, one roofed forest biome, one mesa biome, one witch hut, one village and one ocean monument that are at most 1024 blocks from the origin and it takes about 5 minutes to find a seed.

The features I've been building for myself are:

Can you think of anything else you'd want to search for in the Amidst toolbox?

stefandollase commented 8 years ago

This already sounds pretty good. I also did not think of more filters until now. However, I don't know the SeedFinder very well, so there might be more ideas over there. I like the group of biomes idea.

Regarding the time it takes to find the seed: It might be possible to calculate a percentage how well a seed matches to given restrictions. This would enable us to display seeds with a pretty good match. For example if there is a match of 98%, the user might decide to take it and not wait another hour to find a better match. However this is probably something for a second iteration of this feature.

An even more difficult addition is to display the estimated time to find a seed, based on some pre-calculated averages. However, this is really complex, since it requires us to provide the pre-calculated averages for each combination of Minecraft version, biome and structure. Thus, this would maybe go into another iteration of this feature.

For now, we should aim for the simplest possible implementation that is flexible enough to configure many different matches.

How long do you think does the implementation take? This is just to get a rough idea. Also, can you create the pull-request early on, so we can talk about the changes from time to time?

mgod commented 8 years ago

Pull request up for basic filtering demo. Please let me know if I'm not meeting the style expectations for this project as well as any other feedback. I'm pretty sure I can build out everything except UI integrations in the next week or so (I'm not familiar with the UI components in this project, so I'm not sure how long it'll take me to build that).

moulins commented 8 years ago

Ah! I've had a very similar idea some time ago, and I've given it some thought.

UI-wise, I think it would be better to have a dedicated "search" function, instead of selecting a filter for the whole program. It would allow to display a history of latest searchs, etc...

Here's a quick UI mock-up I made this week-end for making simple searches (a single biome type) : 001

Considering more complex queries, I've been toying with the idea of creating a mini-language for them. For example, if you want a seed with a Jungle biome in 500 blocks of [0;0], and a Taiga or a Ice Plains biome in 500 blocks of [250;250], you could write : Jungle@[0;0]:500 and (Taiga or IcePlains)@[250;250]:500

I think such a language would be easier to write and understand than a JSON blob, but it's obviously more work (we have to write a parser).

stefandollase commented 8 years ago

@moulins Thanks for the suggestion. While I like DSLs (I really do!), I would like to keep this as simple as possible for now. We can always add a nice GUI and/or DSL later on. For now, I would like to keep it as simple as possible. One reason for this is, that I currently do not have the time to review complex changes.

jmessenger235 commented 8 years ago

Frankly, having used the JSON interface, I don't think we need a DSL. I think that if we were going to go that far, we might as well add a GUI.

On top of the other features mentioned, I'd like to add the ability to do searches for quad witch huts and do different matching levels to log the "quality" of a seed along with the ability to run multiple criteria at the same time. The Quad witch hut search is pretty simple. Doing matching to a certain level of quality is a bit more of a challenging idea.

One of my first thoughts on matching was to find the nearest instance of a biome and do a report on all of the requested biomes. This however is pretty challenging to code an may not be helpful for many seed criteria.

My second thought was Bit Flags for the optional criteria and a way to define the mandatory requirements for a seed of interest, but that limits you to one set of criteria and is a bit hard to use for anyone without a programming background.

The best I have been able to come up with (That also syncs well with #129) is adding a group and criteria level to each filter and adding that info to the history file for matches and turning on history any time continuousSearch is true. The values would be defaulted to group and level zero. Meaning that everything would behave like a default search until you changed it to do something more specific. In the example below all segments are required and part of the group zero match. The other lines are commented for behavior

{
  "continuousSearch": true,
  "structureFilters": [
    //This filter adds a group one match so that any map with a quad witch hut closer than 4096 is returned.
    {"distance": 4096, "structure": "Quad Witch Hut", "minimum": 1, "group": 1},
    {"distance": 2048, "structure": "Village", "minimum": 3},
  ], 
  "biomeFilters": [
    //This filter adds a group two match that looks for Mushroom island near spawn
    {"distance": 512, "biomes": ["Mushroom Island", "Mushroom Island M"], "group": 2},
    {"distance": 2048, "biomes": ["Mushroom Island", "Mushroom Island M"]},
    //These two jungle search filters would require a jungle in the 4096 for group zero, but would flag the match as a higher level if the jungle were within 2048
    {"distance": 4096, "biomes": ["Jungle M", "Jungle Hills M", "Jungle Edge M"]},
    {"distance": 2048, "biomes": ["Jungle M", "Jungle Hills M", "Jungle Edge M"], "lvl": 1},
    {"distance": 2048, "biomes": ["Mesa (Bryce)", "Mesa Plateau F M", "Mesa Plateau M"]},
  ]
}

Thoughts on that matching idea are much appreciated. I was also debating to make the levels a set of criteria or an additive factor so the match level simply becomes a sum of all of the levels for all the criteria that were matched.

moulins commented 8 years ago

Frankly, having used the JSON interface, I don't think we need a DSL. I think that if we were going to go that far, we might as well add a GUI.

In my mind, making a GUI for complex queries is not very useful (and way more difficult) than making a DSL, but I agree that a DSL may not be necessary : it was just a cool idea I had.

I like your idea of "points" which contribute toward a total matching level, but I don't quite understand the purpose of the groups : are they alternative ways of matching a seed ? (i.e. if all matches of a specific group are true, the seed is returned)


Now, here is what I have in mind for the JSON file :

{ 
  "continuousSearch": true,

  //Unless specified otherwise, all filters in this list must match for the seed to be valid
  "filters" : [

    //This filter searchs for specific biomes
    {
      "type": "biome",
      "distance": 512, "biomes": ["Forest", "Forest Hills"]
    },

    //This filter searchs for specific structures
    {
      "type": "structure",
      "distance": 2048, "structure": "Village", "minimum": 3,

      //This makes the filter optional
     "optional": true,

      //Number of points to be added to the level score if the filter matches (0 by default).
      //A seed with a bigger score is of better quality.
      "score": 10
    },

    //This filter matches if at least one of its children matches.
    //Only the child with the highest score will contribute to the total score.
    {
      "type": "or",
      "children": [ {...},  {...}, ...]
    }, 

    //This filter matches only if all of its children matches.
    //All children contribute to the total score (same behavior as the main filter list)
    {
      "type": "and",
      "children": [ {...},  {...}, ...]
    }
  ]
}

The "score" attribute is like your "lvl", and if I understood correctly, the "or" filter should be able to do the same thing as your groups.

With this format, we have to take into account nested filters and all that jazz (it will definitely not be simple!), but I think the flexibility is worth it.

Maybe we should first make a minimal working implementation, that can be extended later (we should make sure that the code is easy to extend).

jmessenger235 commented 8 years ago

The groups were essentially a way to define multiple sets of search criteria to be run against each seed and a way of differentiating the matches in the history file. It looks like your "and" and "or" criteria would work to similar ends, but without the traceability in the history file.

How would you implement nested filters in your JSON and what would the use case be?

Check out #129 that is the minimal working implementation that I was basing my comments for improvements off of.

Below is just a formatted version of the previous JSON with some additional comments

// unless otherwise specified, all filters will only be part of the default search and will be required.
{
    "continuousSearch": true,
    "structureFilters": [
        //This filter adds a group one match that looks for a quad witch hut closer than 4096 
        //and will be required to return as a group one match as no level is specified
        //matches would be logged in the history as a group one seed for sorting or filtering
        {"distance": 4096, "structure": "Quad Witch Hut", "minimum": 1, "group": 1},
        {"distance": 2048, "structure": "Village", "minimum": 3},
    ],
    "biomeFilters": [
        //This group two filter looks for Mushroom island near spawn
        {"distance": 512, "biomes": ["Mushroom Island", "Mushroom Island M"], "group": 2},
        {"distance": 2048, "biomes": ["Mushroom Island", "Mushroom Island M"]},
        //These two jungle search filters would require a jungle in the 4096,
        //but would flag the match as a higher level if the jungle were within 2048
        {"distance": 4096, "biomes": ["Jungle M", "Jungle Hills M", "Jungle Edge M"]},
        {"distance": 2048, "biomes": ["Jungle M", "Jungle Hills M", "Jungle Edge M"], "lvl": 1},
        {"distance": 2048, "biomes": ["Mesa (Bryce)", "Mesa Plateau F M", "Mesa Plateau M"]},
    ]
}
jmessenger235 commented 8 years ago

Another feature worth considering is minimum area. Mainly so you could search for particularly large biome areas to help people that want to look for certain biome sprawls.

moulins commented 8 years ago

How would you implement nested filters in your JSON and what would the use case be?

By "nested filters", I mean that orand and filters can contain others or and and filters. For example, if you want a seed with one set of biomes or another set of biomes, you would write :

{
  "continuousSearch": true,
  "filters" : [{
    "type" : "or",
    "children" : [
      {
        "type": "and",
        "children" : [
          {"type": "biome", "distance": 1024, "biomes": "Forest"},
          {"type": "biome", "distance": 1024, "biomes": "Plains"}
        ]
      }
      {
        "type": "and",
        "children" : [
          {"type": "biome", "distance": 1024, "biomes": "Mesa"},
          {"type": "biome", "distance": 1024, "biomes": "Desert"}
        ]
      }
    ]
  }]
}
stefandollase commented 8 years ago

I just merged in #129 and am now working on a better integration of it into Amidst. I still have to catch up with your comments above.

stefandollase commented 8 years ago

@mgod Overall, I think you did a good job. I did several adjustments and started to implement a proper workflow for it, especially to get the threading right. E.g. you were setting a world for the main window from a worker thread, which should only be done from the event dispatch thread. The worker executor has some methods to handle such cases. I also started to implement a simple GUI, which is really just a first draft to get the functionality right.

I moved the searchContinuously flag from the JSON to the GUI, because it is not part of the world filter. However, I was thinking about moving the worldType into the JSON, because this does have an effect on the search result. I am not sure about this yet. What do you think?

Sadly, I am running out of time for today. I hope to get this done by the end of the week.

jmessenger235 commented 8 years ago

@stefandollase FYI I have made a number of updates on my fork that implement a bunch of the features/fixes discussed here or on #129. Essentially, everything but a GUI. I will likely create a PR tomorrow morning once I finish testing the Witch Hut cluster detection and then follow with a PR against the wiki that can be merged when released. Any pre-PR review and comments are appreciated. The commit is here, but I have added another feature so that is not the latest. I know you are working on other code at the moment that may cause conflicts. If there are any conflicts, don't worry about them, just change whatever you need to and I'll resolve any conflicts this code creates.

jmessenger235 commented 8 years ago

@stefandollase I finished a documentation page for most of the features that I mentioned above and have submitted the PR for the code that links to a temporary copy of the documentation associated with it. If any of the changes you make conflict, I'll resolve those Sunday afternoon EDT.

stefandollase commented 8 years ago

Sadly, I did not get any more time in the last week to work for this project. This week I also cannot spent any time for it, but I might be able to finish my changes next week.

@jmessenger235 Thanks for your effort. However, I would really have liked you to wait for my changes, because I am pretty sure they will cause conflicts with your changes, but I guess we will have to wait and see how it turns out.

jmessenger235 commented 8 years ago

@stefandollase I wanted to wait as well, but I was also in a quandary as I am trying to find a good seed for a pending server reset. I went the route of putting it out there for anyone that wants to test and review it and I'll deal with the conflicts when they coming up. Hopefully they will not be too bad, but that was a risk I took.

I'm also planning on adding the ability to search a seed list and will likely add this tomorrow afternoon. This is because the QWH search is simply too slow. The bottleneck seems to be structure generation, which I would assume is tied to the low performance of reflection, but don't have the expertise to know for sure and it could be due to other factors (including a sub-optimal search algorithim). Searching one of the comprehensible lists of 125k+ out there is simply a better use of CPU cycles.

stefandollase commented 8 years ago

@jmessenger235 Thanks for the explanation. Now I can understand why you did the implementation that quickly.

I like the idea of adding another method to supply the search algorithm with seeds (the seed list). Again, I would like you to wait with the implementation, but again, if you need it before I can finish my changes then I guess the only way to go is to implement it now and resolve the conflicts later on.

jmessenger235 commented 8 years ago

@stefandollase how are the changes coming along?

stefandollase commented 8 years ago

@jmessenger235 I really currently don't have the time ... sorry. I will reply to this issue as soon as this changes.

jmessenger235 commented 8 years ago

Alright. Thanks for the quick update!

ForestFeather commented 8 years ago

Question: Is this focused on a circular radius or a square radius? (To see if for example it'd fit inside a world border of X distance centered on spawn OR 0,0.)

jmessenger235 commented 8 years ago

All implementations at the moment focus on a square radius around 0,0 for simplicity sake.

jmessenger235 commented 8 years ago

I have put together what I have understood the general needs for this feature based on what is here and the needs of the seed search project that I am currently a part of. The link is https://docs.google.com/document/d/1xaQ5N9I5Jm1hEM9Tcg8LhywiaBb5J8VHuZUQzEr1zYk/edit. Commenting is enabled rather than editing only to prevent deletion of the content by people not associated with this project. If we need or want to this could be moved to a markdown file on the wiki as well. I just wanted to have this somewhere so that stuff could be added or clarified as needed as GitHub discussion is not really suited to design documents. Can anyone think of anything else that needs to be added?

@moulins We seem to agree completely on the need to clean up the memory usage of the existing implementation. I already made huge strides on that in the current version in #169 and have the final segmentation touch in a small bit of code that is to support non-central searches and non 512 segments for searching seed lists. That said, for most searches segmenting like that will probably not improve performance because of the ratio of matching criteria to unmatched criteria. It could be a small boost though.

By there being redundant data in your json from the comments on #186, I mean that portions of data in your json can be inferred or are far more verbose than @mgod's implementation. Type is an example. In @mgod's implementation it is inferred from its position resulting in much less json content. The other item is optional. Optional should be inferred from score. Additionally, nesting criteria adds the requirement for children tags and indentation for more complex searches further increasing size over a flat implementation like the one in the master branch. Additionally, there is still no way to distinguish one or type match from another in any logging of any sort.

Setting the details of both aside for a moment, is there a practical use case where the nesting would allow you to do something that could not be done with the groups structure that I mentioned? It seems like a different facet of the tags vs directories debate. There are a few cases that would be shorter to write with nested criteria, but that should be weighed against making the average search longer to setup (i.e. more json) and if those criteria would be better done with scoring. The case for nesting seems only makes sense at three nestings deep or more. A flat organization is cleaner in every other case. If multiple groups were supported, the flat version would always be cleaner and shorter. Beyond length a nested version makes testing much more work than a flat version because those deeply nested cases need to be tested. i.e. a good to complete set of test cases is much easier to write for flat structure.

Apart from implementation specific optimizations there any other specific advantage to the architecture of code and json that you are proposing in #186?

moulins commented 8 years ago

I agree that the json structure I proposed could use some improvement, but the focus of #186 was on the implementation strategy more than the json. I'll try to create a more concise nested format when I have the time. (I almost think a XML format would be more elegant here)

Additionally, there is still no way to distinguish one or type match from another in any logging of any sort.

To distinguish between different alternatives, we could attach a tag to certain criteria, and log the tags of the criteria which contribute to the match.

On the subject of flat vs. nested, I believe a query of the form (A or B) and (C or D) cannot be expressed cleanly with groups and scores, but at the same time, a query like two of A, B, C is akward to express with ands and ors. If we choose the nested format, we should definitely add some sort of score system with it.

I find strange that you think non centered and non 512-sized searches can be added "at the end", because it seems to me that your implementation relies on this to be efficient. (You only use a single array for the biome data ; imagine two searches with radius 100, one at (-5000,-5000), the other at (5000,5000) : will you compute all the data in a 10000x10000 square ?)


For the performance improvements enabled by the architecture of #186, I believe there are three :

jmessenger235 commented 8 years ago

If a nested format is to be used it would be better to make it terser and it might be better as XML to help eliminate some tags, but a change to XML seems like it would be 6 of one half-dozen of the other.

The naming tags would be essential as we need to classify stuff for logging or the more advanced searches will become of little use.


I think that flat vs. nested is where we are coming up with the differences as the code is heavily linked to the JSON structure. In a flat structure, I agree that the case (A or B) and (C or D) is not easy to perfectly express. I see that case as the three deep nesting that I said is where it excels (Base criteria of none in this case, and statement, or statement). However, the score can be used to digit-flag each of the criteria so that they can be simply picked out of the log file with ease and gives you more information about the seeds you are searching for. Alternatively, for biomes, you can specify multiple sets of biomes in the biome array. That would allow for that case to be quite nicely done for biome versions of that case and the structure tag could be altered similarly to allow the same case for structures in the current implementation. In combination, that allows for doing most versions of even something like ((A or B) and (C or D)) or ((E or F) and (G or H)). However, I do agree that the flat implementation is limited in this case. It is not as clean or clear and should be documented in either implementation, but this basic question brings me to the counter question that drove most of what I wrote.

That is the question of usability and optimum user use. What is someone searching for (A or B) and (C or D) actually looking for and are they thinking too small? I am hard pressed to think of a set of criteria that would fit that format and have a seed that would be superior for its intended use that would not better be expressed as an and criteria. I.e. What is the user looking for in (A or B) and (C or D) that is not even better satisfied by A and B and C and D? These observations do not take into account an exclusive or as neither implementation proposes that and I think that it would be a feature of highly questionable value with a performance hit. Having run dozens of searches across trillions of seeds using a variety of apps, I have actually had to really learn to add more and more requirements. The speed at which seeds are searched, simple criteria are so quickly matched with so many results that I have had to add optional criteria as required simply to keep the results manageable for some criteria. Based on that reasoning of (A and B and C and D) > (A or B) and (C or D) one is left with and criteria, which is exactly what a flat structure excels at expressing.

Am I missing some cases here? Thoughts on that logic? Thoughts on the digit or bit-flagging?


I find strange that you think non centered and non 512-sized searches can be added "at the end", because it seems to me that your implementation relies on this to be efficient. (You only use a single array for the biome data; imagine two searches with radius 100, one at (-5000, -5000), the other at (5000,5000): will you compute all the data in a 10000x10000 square?)

I would not consider this anywhere near "the end." When starting on the work on @mgod's implementation, I spent a good deal of time considering optimizations and considering architecture. Non-512, segmented searches, and non-centered searches where some of the first things that I considered. They were simply not the first that I implemented as I had more pressing needs to actually find a server seed. That said, If I didn't think they were easily doable, I would have proposed a re-implementation myself and I would have done so a long time ago as those are features that I will have need of to go with another MC project I am working on. Even if this were the end, I'm a developer that does maintenance on legacy code as a large part of my work. The concept of "at the end" is the reality of most of the code I write.

In this case the implementation is actually quite simple. The non-512 locking was for structures. This is because of nuances of how Minecraft generates structures. @mgod or some of L64s work on the SciCraft Seed Finder would likely add clarity to the details of the reasons behind the MC behavior. There are no performance reasons for this. For non 512 you can simply remove the check from the base class. Make adjustments for the structure filters so that you expand the range to the nearest 512 and then exclude the structures outside of the specified range. Those are really simple changes. Biome filters should work as is once the enforced range is removed.

(You only use a single array for the biome data; imagine two searches with radius 100, one at (-5000, -5000), the other at (5000,5000): will you compute all the data in a 10000x10000 square?)

In #169 I optimized the implementation to use a single array of biome data generate it once and to pass it to each filter. Keep in mind that it is generated as one biome data point for 16 (4x4 area). The implementation does not currently scale it up like MC does, we just search it at that level to save compute without compromising accuracy in any reasonable way. Also keep in mind, what you are asking is a question about the implementation of an un-implemented feature. The answer is that it could and should be implemented in such a way that it would not have to generate the data. Essentially, the search needs to be segmented before release for memory reasons. As part of that, the decision needs to be made if we will support a separate center point for each criteria before this becomes a question. A separate center for all criteria is one thing and would be quite easy (A basic offset field). However, for each criterion individually, would require more code, and would probably pull in your region classes to make the coding easier. The world filter can make some basic decisions about what to generate and pass to which filter based on the regions.


For each of your bullets:

moulins commented 8 years ago

On the flat vs. nested debate, I think a nested structure is inherently more flexible, but I agree that work need to be made to have it in a more usable form.

I believe you and I don't think at the same scale : your use case is finding a seed as optimal as possible for a server, and you are ready to let the search run for several hours, whereas mine is more 'casual' in scope : I just want to find a seed with some rare biome near spawn for my SSP world. As such, I don't have much experience in very complex searches, so I can't say if your argument is valid.

I nevertheless think it would be a shame to loose flexibility if it comes without much usability cost (and I maybe have an idea to make it work ; we'll see...). This also applies to the question of separate centers for criteria : here it is completely optional and doesn't cost anything! (other than the ~300 line Region class, which is also useful elsewhere)


On the subject of performance, and in response to each of your bullets :

{
  "continuousSearch": false,
  "filters": [
    {
      "type":"biome",
      "radius": 1024,
      "square": true,
      "biomes": ["Mesa"]
    },
    {
      "type":"biome",
      "radius": 1024,
      "square": true,
      "biomes": ["Ocean", "Deep Ocean"],
      "negate": true
    }
  ]
}

In words, we want a seed with no oceans in a 2000x2000 square centered around spawn, and we also want a Mesa biome. There certainly are other cases for negative criteria, but an ocean-less seed seems quite useful to me. Note that this can't be done by simply 'inversing' a positive filter, as we have both a positive and a negative criterion. From my measurements, 11.1% of seeds satisfy the Mesa condition, 2.82% satisfy the no-ocean condition, and 0.31% satisfy both at the same time.

When a seed matches, it takes 47ms to test it; as expected, the time is constant, as we have to search all the area to ensure there is no ocean. When a seed don't match, it takes between 5ms and 30ms to test, with an average time of 7.7ms. Again, it is what we expect. Note that if the seed have a Mesa AND a ocean, we won't necessarely have to wait the full 47ms : it will exit early. In the end, we have an average time of 7.8ms. (47*0.31% + 7.7*99.69%)

In your implementation, the biome data is always fully generated, so each seed takes the same time (~40ms), whereas I achieve a 5-fold increase in speed thanks to segmentation! As you said,

you have to generate every piece of biome to determine that the last filter does not match,

but if this filter matches most of the time (like the ocean one), this mean that most of the searches will stop early, and that the full price of the search will only be paid by a minority.

jmessenger235 commented 8 years ago

As such, I don't have much experience in very complex searches, so I can't say if your argument is valid. I nevertheless think it would be a shame to loose flexibility if it comes without much usability cost

Your core tenet seems to be that nested is more flexible, yet from the quote above, you concede that you don't know if the flexibility is even useful and that it does have some usability cost. I'll address the more flexible argument with some greater clarity, but I honestly hoped you had a better reason than that. Flexibility for the sake of flexibility results in code that can become a maintenance hassle, get scrapped or both. I have watched weeks of dev effort scraped because flexibility for the sake of flexibility turned into features that were unusable or a constant source of complaints as they made the basic use cases annoying.

At the high level, I will make two basic counter arguments. The principal argument is that flat is just as flexible as nested, it simply provides the flexibility in different ways. This was not especially clear in my last comment, but I have had more time to think about how the flexibility is exposed. I'll also make the further argument that the method of exposing the flexibility in the flat structure is shorter and allows for more efficiently and beneficially expressing nearly all criteria.

Flexible Flat

In my previous comment, I expressed multiple ways on how to deal with the use case (A or B) and (C or D). The basic reason was because the or as a child criteria of an and is the only place where grouping does not beat the nested version in usability handily. The other cases where and is a child of or are much easier to express in a flat criteria as criteria do not have to be copied and pasted, they can simply be listed as part of multiple groups. The simple way to express the problem use case is with the flagging that I mentioned. Essentially, each criteria is given a score. For example: A=1000, B=0100, C=0010, and D=0001. Thus a seed that matches ABCD would be logged as scoring 1111, ABC as 1110, ACD as 1011 and so forth. To make this better, a int param could be added to the search to require the groups with all optional criteria to match at least so many of their filters. That param could also be implemented on a per group basis. AB and CD would both match in that capacity, but those are easily filtered out of the search log. In the current implementation, the flagging would be limited to numbers, but could easily be changed to strings or characters. The core remains the same though. The case is fulfilled. Flat is flexible.

Exposure's Edge

Flat works, but I said I'd argue why it is better than nested. So, lets go back to the (A or B) and (C or D) use case and the scoring A=1000, B=0100, C=0010, and D=0001. Thus ABC logs as 1110, ACD logs as 1011 and so forth. Why is that better? There is more information. Whoever is doing the search is going to have to make a decision at the end which seed that they are going to use. In order to make that decision, they are going to need information. Maybe they realize that they may need a 1111 match as it fits better and is more common than they realized, maybe it is a 1010 match they prefer, or perhaps just AB. In any case, they have the information to make that assessment. Can you provide that information in a nested search? Yes. I wouldn't argue that point. Nevertheless, I will state that the flat structure will be shorter to write because it will have less duplicate JSON data. In essence, it provides information by default as a result of the flat implementation exposure's methodology and that is the edge the flat structure has.


That use case for negation and its performance is interesting. It makes the spawn exclusion I mentioned being a viable use case into a practical example and one that has performance benefits like I acknowledged, but more only more substantial than I would have expected as the exclusion area is so large so as to exclude seeds that could be very valuable as boat travel is faster than sprinting by about 30%. However, the example does not negate the functionally non-existent improvement in cases without negated criteria. As we agree that segmentation should be implemented for memory reasons, I updated the doc on the point of segmentation performance, but don't have any other observations on performance that are worth mentioning at this point of implementation.

alchimystic commented 8 years ago

Hello. I started today working on a similar scanner for biome proximity. My seed scanner uses MCP (9.24 beta 1 currently), not Amidst. I didn't look in detail at the functionality (and i didn't look that much at Amidst code yet), but there is at least one difference: i am not scanning for features (witch huts, temples, villages, etc...), just biomes. This way i can get results a lot quicker. Actually, i was able to get the biome proximity (in chunk distance) of a 2000 block radius around the spawn point in less than 20 seconds. I will now run my scanner on seeds, starting from 1, and keep the results. Maybe i can later transform my results and publish them in a format convenient for you.

jmessenger235 commented 8 years ago

@alchimystic what kind of seed per second rates are you getting? How are you saving the data? Are you saving all of the biome data or just a set of flags for which biomes are present?

alchimystic commented 8 years ago

@jmessenger235 I am not measuring the time per seed, but i can get all the biomes in 4000x4000 area in about 15-20 seconds. For each seed, i am just saving a text file with one line containing the toString of a list (List < Pair < String,Integer > > ), where the String is the biome name, and the Integer is the chunk distance to the spawn. With this i can grep easily for specific items. Should not be too difficult to output some JSON instead.

I started scanning 40 minutes ago (with 3 independent workers) and i have 330 results (so around 8 per minute).

moulins commented 8 years ago

I am not measuring the time per seed, but i can get all the biomes in 4000x4000 area in about 15-20 seconds.

That seems slower than what we do in our implementations (biome data in a 2000x2000 square takes 50ms to compute on my machine, so a 4000x4000 square should take ~0.2s)... Do you create the full-resolution data or, like AMIDST, only the quarter-resolution data ?

alchimystic commented 8 years ago

I use the MC code itself, more specifically, the arrays from BiomeProvider.getBiomesForGeneration. Probably running this code has a lot of overhead somewhere. I really have to look at Amidst code. I'm curious how can it be so close to the real biome layout of the game.

Did anybody thought about porting the Amidst code to Scala? I'm sure we would shrink the codebase by half (at least), and would be a hell of a project :D

stefandollase commented 8 years ago

@alchimystic Thanks for the suggestions about using Scala. However, unless somebody can come up with some major advantages for Amidst, we will not use it. I mainly see the following disadvantages:

alchimystic commented 8 years ago

Well, you are right in most of those. The biggest advantage would be to reduce the codebase immensely, and also (why not) to learn a new language.

For example, the class SymbolicConstructor could be rewritten in Scala like this:

case class SymbolicConstructor(parent: SymbolicConstructor, symbolicName: String, constructor: Constructor[_]) {

  def call(parameters: AnyRef*) = new SymbolicObject(parent, constructor.newInstance(parameters:_*))
}

A lot more compact, less boilerplate. Of course some of the Scala advantages are also present in using Java 8, but not all :)

stefandollase commented 8 years ago

Don't get me wrong: I like to learn new languages, however I think it does not fit this project very well.

Reducing the code base is not a value in itself. When it helps the reader to understand the code, it is better to have a more explicit code base. I am sure, that it becomes easier to read the Scala code after one becomes used to it. However, this is exactly the problem: This project should not require you to become used to Scala. Thus, I guess it is not gonna happen.

jmessenger235 commented 8 years ago

What I would actually be most interested in is porting the Biome and structure code to Aparapi or some other GPU capable language/format as that would have some elephant in the room sized advantages.

stefandollase commented 8 years ago

@jmessenger235 Did you already work with Aparapi? It is new to me and thus I don't know whether it will improve the performance a lot, without causing other issues. In general, as I said before, we should first get the basic feature done and afterwards we can worry about the performance. Before we decide to apply a specific technology to improve the performance, we should definitely do some measurements, to identify bottlenecks. Afterwards, we can choose a specific technology to speed things up.

Personally, I believe that the bottleneck is not the processing power that is used by our filter, but it is the Minecraft world generator. This assumption is based on the fact that the Minecraft world generator is also the bottleneck when it comes to displaying a world with Amidst. Also, the filter code is very simple. Thus, it also should not help to run the seed search multithreaded, because the world generator cannot run multithreaded: It is synchronized. However, we might be able to load the Minecraft jar file multiple times (multiple MojangAPIs) ... once for each of the threads. The seeds are then distributed over the threads, so that each seed is checked by exactly one thread. This might speed up the search, however it will probably require quite a lot of memory, because we basically load the game multiple times. We will have to experiment with this, but again: We should not worry about this right now and all of this is just an assumption. I might be completely wrong and this is why we should do some performance tests before we start the optimizations.

Finally, let me come back to Aparapi. I am a little bit worried that it might reduce the platform independence of Amidst. Also, as stated above I am not yet convinced that it is worth the effort in terms of the performance gain, but we will have to look at this more closely in the future.

CubeTheThird commented 8 years ago

If the .jar were to be loaded multiple times, could this be equally used to boost the performance for a single seed, where each thread is testing different regions? Or do the same bottlenecks apply for this workload?

jmessenger235 commented 8 years ago

@cubethethird having each thread search a separate seed will be easier to implement. Additionally, the performance of multiple threads per seed could backfire in a couple of places resulting in reducing overall performance.

@stefandollase you are correct. The GPU could massively accelerate all of the code, but only if the MC code could be used or converted to run on the GPU. This is not something we likely will want to do after looking into it as the effort would be to large.

CubeTheThird commented 8 years ago

Yes I agree it would be easier to implement, if each thread were testing different seeds. I was just wondering if there was a potential performance boost if say each thread were to pre-load different areas of the same seed, where the testing could then be performed on this pre-loaded data.

As for GPU usage, it's very difficult to say there would be a reasonable boost for utilizing it. The main place this would be an advantage would be loading the world data, but only if this was an easily parallalizable routine. The searching aspect would hardly be able to take advantage of a GPU's capabilities, since this operation is practically linear, and will greatly vary depending on the search criteria.

jmessenger235 commented 8 years ago

The gotcha is if a seed matches positive or negative and some of the generated data is never used.

The GPU being limited is because you are not thinking data parallel. Each GPU thread could be assigned a separate seed if all of the generation code was ported to the GPU. Trust me when I say that a GPU with enough memory to handle all of the code and arrays would make that run mind-blowingly fast.

CubeTheThird commented 8 years ago

This is true; unused generated data may be a waste of CPU time.

Data parallelism is not the issue. The type of data utilized here certainly has the potential for performance gains through parallelism. What I'm bringing into question is the set of operations being performed on said data. GPU processing is best suited for stream processing, or SIMD-type operations. While I have not thoroughly studied the world generation code, I speculate that (and correct me if I'm wrong) it has at least a few conditional branches, for which GPUs are not best suited. Furthermore, by having each GPU thread run independently, both to handle the world generation and the searching, the threads will most definitely be out of sync, as some will finish their searches more quickly than others, before moving on. I would anticipate this workload could initially be fairly quick, but performance would degrade once threads are no longer performing the same operations.

A better solution would be to have the GPU perform the world generation (most likely for multiple seeds at the same time), before passing the data long to the CPU to do searching. Yes this still has the problem of wasted data, since it's very likely a seed can be deemed as valid or invalid before it's all used, but this would likely result in more consistent performance and better utilization of GPU resources.

stefandollase commented 8 years ago

I think it is possible to generate the biome information for one seed in parallel using multiple jar file instances. However, without looking at the code, I agree with @jmessenger235 that this might be hard to add to the current code base. Also, it is only a benefit if you want to verify that one given seed matches your query. As soon as you want to scan through multiple seeds you will not notice any performance improvement, because it does not matter if you

  1. process one seed per second using four threads or if you
  2. use four threads with each of them processing one seed every four seconds

In both cases you processed four seeds after four seconds using four threads. However, this does not yet consider synchronization overhead, which makes the first case less efficient than the second. I think that it is really uncommon to check whether one given seed matches a given query and this is probably always an operation that is fast enough. When we think about performance, we should always think about the case where the user searches for matching seeds in a large number of seeds. As illustrated by the example above, it is better to use one thread per seed. Luckily, this is also easier to implement.

Regarding the GPU: I assume it will only help us if we can run the complete biome generator and seed search code on the GPU, without transferring data to the CPU. This is because transferring data is slow. However, again, this is just an assumption. It might turn out to be irrelevant or just wrong. However, if it is true we might run into problems, because this is quite a lot of code that needs to run on the GPU. The GPU memory might be too small for many users and thus this optimization would not be helpful. I can't say it often enough: This is all based on assumptions, which is also the reason why I don't want to think about these optimizations too much before we have a solid single threaded implementation. Once we have that, we can try several optimizations and measure the performance gain for different hardware and use cases. If an optimization actually improves the performance in the most important use cases, it still depends on its code complexity whether it will be merged.

jmessenger235 commented 8 years ago

@CubeTheThird The CPU would not be able to search the quantity of biome data a GPU could generate purely on the basis of the flops ratio between the two. However, generating and searching only small chunks of the biome for a given seed at a time would have better GPU usage.

@stefandollase Your assumptions about GPU memory transfer are generally correct. I think the GPU would have the space to run the code though based on my experience running the Blender rendering kernel. We do need to focus on the implementation at hand. Something like this is likely out of the scope of amidst. I simply mentioned it as it is on my mind for another project where porting the biome and structure gen code to GPU for performance reasons is being seriously considered.

I understand time has not permitted, but we are waiting for your code for thread syncing and the other features you mentioned so this discussion is not really distracting me from implementation as any work I do will cause more work to merge into your changes and I have found what I needed for seeds at the moment. Additionally, we are going to need some executive input on flat vs nested at some point in the near future. I don't think we are going anywhere in this discussion, we simply prioritize different things leading us to rather mutually exclusive conclusions.

@moulins I have a idea for you. What is your take on the following structure:

{
  "continuousSearch": true,
  "biomeFilters": [
    {"distance": 2048, "biomes": ["Mushroom Island"], "group": ["Rare"]},
    {"distance": 2048, "biomes": ["Jungle M"], "group": ["Rare"]},
    {"distance": 2048, "biomes": ["Mesa (Bryce)"]},
    {"distance": 2048, "biomes": ["Plains"]},
    {"distance": 2048, "biomes": ["Ice Plains Spikes"], "group": "Cold"},
    {"distance": 3072, "biomes": ["Mega Spruce Taiga"], "group": "Cold"}
  ],
  "structureFilters": [
    {"distance": 2048, "structure": "Village", "minimum": 1 }
  ],
  "logicFilters": [
    {"or": ["Cold", "Rare"], "group": "foo"},
    {"and": ["Default", "foo"], "group": "bar"},
    {"report": ["Rare", "bar"]}
  ]
}

The filter itself doesn't find anything of use, but is used to demonstrate the structure. It adds three additional logic filters. Each of the filters has an array of groups. The members of those groups would be treated like the children in the nested version. The report array is to allow for only looking for the results of certain groups. Essentially it prints matches for (Cold OR Rare) AND Default and prints any seed that matches Rare. I don't have time to explain how it would be implemented right now, but it can be done without too much work. @moulins, your thoughts?

On a secondary note, if we end up going with a nested structure it needs to be shortened similar to below.

{
    "continuousSearch": true,
    "filters": [
        {
            "or": [
                {
                    "and": [
                        {
                            "distance": 1024,
                            "biomes": [
                                "Forest"
                            ]
                        },
                        {
                            "distance": 1024,
                            "biomes": [
                                "Plains"
                            ]
                        }
                    ]
                },
                {
                    "and": [
                        {
                            "distance": 1024,
                            "biomes": [
                                "Mesa"
                            ]
                        },
                        {
                            "distance": 1024,
                            "biomes": [
                                "Desert"
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Additionally, design needs to be put into all of the optional criteria/features for searching and how it will affect implementation (such as biome area). I added another requirement to the doc that I forgot as it wasn't discussed directly as it will help a lot of searches. That is a way to specify a max distance from center for a feature but a score that will be scaled up the closer it is (scaled linearly or in steps does not matter too much to me). For or type stuff it is important, but when dealing with and criteria it will be essential for productively expressing some matches efficiently.

stefandollase commented 8 years ago

I just merged my announced changes, see #197, #198 and #199. #197 and the source code contain information about what still needs to be done.

I still did not read through your discussion about the different query syntax’s. Thus, I don't yet have an own opinion about that.

jmessenger235 commented 8 years ago

I don't see any todos in the code related to this as mentioned by those PRs. I should have my PR updated to fix a lot of bugs that you perpetuated from the old implementation along with some of the universally agreed on features some time Tuesday. I would have had them done this afternoon, but I didn't realize you were re-architecting the whole codebase.

As the naming of WorldFilterJson*\ and WorldFilter*\ pretty much violates every java naming convention I have ever seen and is not consistent with the rest of the application, can we discuss some better names for those classes like the original names that kept things consistent with the rest of the application or other names if those are unacceptable?

moulins commented 8 years ago

@jmessenger235 : So you define all of your filters, then you combine them elsewhere ? It could work, but there's two things I don't like :


Other than that, here's my new proposition for a nested structure. I've tried to correct the flaws you pointed, namely, the verbosity and the inadequacy for certain queries.

First, I've removed the type attribute; as you said, we can always infer it from the rest :

As in the previous version, all criteria can be negated by adding "negate": true.

All criteria can be named by specifying the name attribute : A named criterion which matches will be displayed alongside the result of the filter. (like your "report" logic filter) All criteria also have a weight attribute (1 by default); an optional criterion will have its weight set to 0.

Now to the compound criterion : it has a single parameter, count. This parameter indicates the minimum number of children criteria which must be matched for this criterion to match. More specifically, the sum of the weights of the matching children must be greater than or equal to count. We see that a criterion with weightset to 0 is really optional: its weight don't contribute to the total.

By default, the compound criterion has a count equals to the sum of the weights of its children : thus it behaves like an AND. If we want an OR, we set count to 1. By using differents weights for each children, we can give more or less importance to each criterion; it should give a lot of flexibility.

I've not yet decided how a scoring system would interact with all of this, so I won't talk about it, but I do think it is a necessary feature to have.

Here's a little example to better understand all of this :

{
    "filters": [ //same semantics as a basic group criterion : the default AND group

        { //a biome/structure criterion
            "biomes"/"structures": [...],
            "radius": 1024,
            "center": [100, 100], //optional
            "square": true        //optional
        },
        { //here we have an OR group
            "children": [{...}, {...}, {...}],
            "count": 1
        },
        { //here we have an quasi-AND group : 
          //we want a majority of the criteria to match, but not necessarily all of them
            "children": [
                { ... },
                { ... },
                { ... 
                    "weight": 2 //this criterion is important, so we give it a greater weight
                },
                { ... },
                { ... },
                { ...
                    "weight": 0, //this criterion is optional
                    //if it matches and the whole filter matches, "mygroup" will be printed with the seed
                    "name": "mygroup"
                }
            ],
           "count": 4 //less than the sum of the weights (which is 1+1+2+1+1+0 = 6)
        }
    ]
}
jmessenger235 commented 8 years ago

In the flat structure we don't have to define "and" for the default search group or for any other group for that matter. All criteria in a group will be treated like a child of an and tag. Each group would be like an or. Essentially a search with two groups of two criteria behaves like (A & B) | (C & D) any seed matching either AB or CD will be returned and logged. Thus, the default search will always be matched as all criteria required as will any group making any logic filters un-needed. Currently, the group name gets logged along with the match and a seed will be printed multiple times if it matches multiple groups (This needs to be fixed to log once and note the separate groups). The logic filters were intended to be completely optional. They are only needed for really advanced complex cases because the default behavior allows for most use cases to be fulfilled.

As clarification, the report filter is not used to display named groups in that example. It is a way to express these are the groups that I care about, the others are not relevant. By default they will all be logged.

The or logic filter is a way to declare a group that is any of the groups in that or filter. The and, is a way of requiring all groups. They were not a way to specify match the criteria of this group as or. That would be done by merging the criteria into a single filter (i.e. village and witch hut) or by doing the quasi-and. To allow for that, we would want to add something like a minScore like I talk about in my critique of your suggestion(s) and demonstrate in the JSON examples below.

To answer your question though, I would be inclined to say score would only be relevant to the structure and biome filters. Anything else is too complicated, redundant, and un-intuitive. Additionally, optional should be inferred from the score in context. The examples give some detail on that.

In the bottom section, I provide a JSON example below of all the flat fields and how they would work to help clarify what I am talking about along with a nested example of improvements that I would suggest.


So for starters, no nested structure deals with my two basic contentions. One, flat data will always be about as terse as nested, but in most cases will be more concise than a nested structure. This is because with the groups, you can use the same criteria in multiple groups (This is why Google uses tags in Gmail as opposed to directories), but in a nested structure those criteria must be duplicated. The second point is that it fails the data by default approach. In a groupings approach for a flat structure, your logic is defined by your groups and so the results always have useful information to differentiate matches. In a nested structure, groups or names to a match set are additional data that has to be added on top of that.

As to some comments on your suggested nested methodology:

I think that in failing to consider scoring you have added redundant information to the structure. Count should be treated as a minimum score and then just use the scoring methodology I mentioned below to replace weighting. This allows the implementation to implement your count feature (Which is a really good idea for either implementation) without adding additional data to the mix. I think that it will be more usable as the concept of scoring will be more intuitive to most than the concept of weighting and much more intuitive than both at the same time. However, in order to do this, since the base score is zero, we would need to change the children array to being either an and or an or array like in my previous JSON example and in the one below. I think that either way, we should make the change to specifying the logic in the array name as it makes what the filter is doing a bit more intuitive in the default case. There is an updated nested version that you can look at below that explains how this would work.


Below are what I understand the two schemas for the JSON to be based on currently discussed features and with my comments from above implemented. Keep in mind that there are also some needed features from the doc I posted earlier that have yet to be discussed. Features like a way to specify a starting seed that will start a sequential search, a way to search a list of seeds or some of the optional items at the bottom of the list. And on an implementation note, many of these will be easier if we implement a search pattern that starts at 0,0 and then spirals outward. This will improve the performance of negative criteria in the reasonable use cases and allow for less calculations for the biomes with the max range.

Overall nested

{
    "filters": [ //same semantics as a basic group criterion : the default AND group

        { //a biome/structure criterion
            "biomes"/"structures": [...],
            "radius": 1024,
            //optional this max allows the search to go to a maximum of 2048 to look for the feature
            //however the score will be scaled down if the search has to go beyond 1024
            //how it is scaled would need to be discussed
            "maxRadius": 2048,
            //because there is a maxRadius present this does not make this an optional criteria
            "score": 5,
            //for structures this means you will need at least this many matching structures
            //for biomes this is the minimum area in quarter resolution required for a match
            "minimum": 5,
            //optional, this will also need to be specified on a
            //seed by seed basis for the whole filter for a seed list search
            "center": [100, 100],
            "circle": true // defaults to square (false) for performance
        },
        { //here we have an OR group
            "or": [{...}, {...}, {...}],
        },
        { //here we have an quasi-AND group : 
          //we want a majority of the criteria to match, but not necessarily all
            "and": [
                { ... },
                { ... },
                { ... 
                    //This would make it an optional criteria if it were not for minScore
                    "score": 2 //this criterion is important, so we give it a greater weight
                },
                { ... },
                { ... },
                { ...
                    "score": 0, //this criterion is optional because of min score
                }
            ],
           //if this is present, we would need to default the score of its criteria to a non-zero score
           // one is used in this example for simplicity.
           "minScore": 4, //less than the sum of the scores (which is 1+1+2+1+1+0 = 6)

           //if it matches and the whole filter matches, "mygroup" will be printed with the seed
           //this is a name on the and group, but would be valid for the or or an individual 
           //biome or structure object
           "name": "mygroup"
        }
    ]
}

Overall Flat (see comments for nested for un-commented values):

{
  //shortened to on array. logicFilters could be merged into here as well
  "filters": [
    {
        "distance": 2048, 
        "biomes"/"structures": [...],
        // if all filters in a group match the seed will be logged as matching that group
        "group": ["Rare"]},
        "maxRadius": 2048,
        //same notes as above including those on min score and max radius
        "score": 5,
        "minimum": 5,
        "center": [100, 100],
        "circle": true
    }
    { "group": ["Cold"]}}
    { ... } //will be part of default group
  ],
  "logicFilters": [
    //group foo is a match of all of group cold or all of group rare
    {"or": ["Cold", "Rare"], "group": "foo"},
    // group bar is a match of the default group and foo 
    //(which in this case is an or of cold and rare)
    {"and": ["Default", "foo"], "group": "bar"},
    //only print matches of group bar and rare as we don't care about default and cold
    //those are just for logic
    {"report": ["Rare", "bar"]}
    //all of the members of group rare must achieve a 
    //score of at least 3 to be considered a match for the group 
    {"minScore": "Rare" ], "val": 3}
  ]
}
moulins commented 8 years ago

I think that in failing to consider scoring you have added redundant information to the structure. Count should be treated as a minimum score and then just use the scoring methodology I mentioned below to replace weighting.

After some reflexion, I think you're right here : having both a score and a weight makes for some unintuitive semantics.

However, I think that putting a score on a compound filter still makes sense if we agree that a compound filter don't aggregate the score of its children. Meaning, if a filter doesn't have a score attribute, it will have its default score of 1, regardless of what is inside it.


To resolve the endless flat/nested debate (because we both have good arguments for why our method is the best), I propose yet another format :

{
  //here we declare all the groups we need to create our search
  "groups": {
    "foo": { //a group named foo containing a single biome filter
      "biomes": [...],
      ...
    },
    "bar:" { //a group named bar containing a compound filter
      "and"/"or"/"minScore": [
        {...},
        //we can still nest compound filters if we want
        { "and": [...] },
        {...}
      ],
      ...
    },
    //for convenience, this is the same as {"and": [...]}
    "baz": [{...}, {...}, {...}],

    //a bare string represents a group : this group matches if foo and bar match.
    "quux": ["foo", "bar"], 
  },

  //here we define the filter which must be satisfied for the seed to match;
  //it could be any type of filter, but in practice it will simply be a reference to a group already defined
  //for more simplistic queries, we can also omit the `groups` attribute and put all the logic here
  //in this case, this format more or less reduces to a simple nested format
  "match": "quux",

  //the list of groups which should be reported for additional information on a match
  //if not defined, all groups will be reported
  "report": ["foo", "baz"]
}

This format allow us to have the advantages of both flatness and nestedness: we can create complicated nested logic without naming all the intermediate steps, and we also can refer to the same group of filters in multiple places.