unexpectedpanda / retool

Retool: a better filter tool for Redump and No-Intro DAT files.
BSD 3-Clause "New" or "Revised" License
349 stars 23 forks source link

Incorrect Filtering for an NES English/Italian Language Set #271

Closed possiblyneal closed 1 year ago

possiblyneal commented 1 year ago

Describe the bug Many non-English and non-Italian games made it to the filtered set. At first glance it seems to only be those with (Asia) in the file name.

To reproduce

Expected behavior I expected only games in English or Italian to be present in the final set. See from the log that it kept many (asia) titles that don't have (En).

Nintendo - Nintendo Entertainment System (Headered) (20230708-030456) (Retool 2023-07-08 13-24-38) [-Bdp] log.txt

Operating system

Retool edition

Retool version 2.00.5

Are you using custom global or system filters? If so, list them below

Global excludes: Pirate, BIOS, Demos

Global includes:

System excludes:

System includes:

What other settings are you using? Default order for english speakers, En & It langs, all other defaults.

Additional context

possiblyneal commented 1 year ago

Also I've noticed that even though the log has:

I didn't receive any of these roms in the filtered set. I didn't have roms named: Venice Beach Volleyball (USA) (MB-91) (Unl) or Venice Beach Volleyball (USA) (NINA-06) (Unl) only Venice Beach Volleyball (USA) (Unl)

unexpectedpanda commented 1 year ago

Most of the time this is a source problem, and not something Retool can address. No-Intro isn't always accurate on languages, and that filters down to the choices Retool makes. In these cases, usually you need to file a ticket on the No-Intro page for the title in question.

After they update the titles, I can rescrape the metadata to provide the accurate information. Recently they did change a bunch of (Asia) titles from En to Zh, so when I update the metadata in the coming weeks that might address some of the problems. Having said that, I'll still take a look at the code in case something else unexpected is happening.

There's another thing to consider -- IIRC titles with unknown languages are also included in a filtered set regardless of settings (as this often includes important things like BIOS files). This is a decision I could potentially revisit, allowing people to filter out titles with unknown languages -- although it would take more than a little rewiring.

Also I've noticed that even though the log has:

  • Venice Beach Volleyball (USA) (NINA-06) (Unl)
  • Venice Beach Volleyball (Asia) (En) (Idea-Tek) (Unl)
  • Venice Beach Volleyball (Asia) (En) (Super Mega) (Unl)
  • Venice Beach Volleyball (USA) (Beta) (Unl)
  • Venice Beach Volleyball (USA) (MB-91) (Unl)
  • Volley Ball (Spain) (En) (Idea-Tek) (Unl)

I didn't receive any of these roms in the filtered set. I didn't have roms named: Venice Beach Volleyball (USA) (MB-91) (Unl) or Venice Beach Volleyball (USA) (NINA-06) (Unl) only Venice Beach Volleyball (USA) (Unl)

I'm not sure when I'll be able to look into this, but it's on my radar.

possiblyneal commented 1 year ago

So I'm understanding correctly, No-intro and retool consider titles with (Asia) to have no language an therefore are included in the set?

To confirm we are on the same page I speaking of titles like below that I got with the settings described.

If that is the case maybe just an option to address titles with the (Asia) tag specifically? Most USA titles don't have any specific (En) tags so filtering out unknown languages could get rid of good stuff to, right?

unexpectedpanda commented 1 year ago

So I'm understanding correctly, No-intro and retool consider titles with (Asia) to have no language an therefore are included in the set?

No-Intro often doesn't properly catalog languages for titles. Sometimes contributors go back and fix this later. For a long time it also had a standard of if a title only has one language, to not tag that language in the filename. Thankfully that's changing of late to account for when you get things like French Canadian titles.

Redump and No-Intro also store languages for titles separately in their databases. Those languages don't always make it to filenames. This is why a scraped version of these databases is downloaded to the metadata folder when you update clone lists — so missing languages can be reassigned to titles. Because No-Intro and Redump are teams of people and I'm just one person, some time can pass before I update the metadata files that might contain the language data (no guarantee) for those new titles. In this situation, titles can slip through.

Finally, when there's no languages provided in either the filename or the metadata, Retool has an implied language as a fallback for each region. For (USA), English is its implied language. (Asia) also has English as its implied language, as many (Asia) titles are just bootlegs in English. Given the recent influx of Chinese titles though, I might have to reassess that decision. No-Intro should really make language tags mandatory for (Asia) titles though, as it should for (Europe) and other multi-country regions.

Bao Qingtian (Asia) (Unl) Jingke Xin Zhuan (Asia) (Unl) Jinqu KTV (Asia) (Unl)

I need to check what is happening with these specific titles. It could just be the case that they're recent, and I haven't updated the metadata for a while so their languages aren't stored. Things are also complicated by No-Intro making some titles "private", so the data might simply not be available.

If that is the case maybe just an option to address titles with the (Asia) tag specifically?

You could remove Asia from the regions list if you want them gone. The only other step I could see from here is to somehow detect language from the filename itself. But that's assuming there isn't an option after loading the title to switch to English...

filtering out unknown languages could get rid of good stuff to, right?

Exactly. That's why if a title's language is truly unknown, it's included regardless of filters. This generally only happens with regionless titles though, or countries that don't have implied languages (usually they don't have an implied language because the language distribution for those regions is too tight to make assumptions).

unexpectedpanda commented 1 year ago
"Bao Qingtian (Asia) (Unl)": {
    "languages": ["Zh"]
},
"Jingke Xin Zhuan (Asia) (Unl)": {
    "languages": ["Zh"]
},
"Jinqu KTV (Asia) (Unl)": {
    "languages": ["Zh"]
},

All three of those titles were previously not included in the metadata. After a fresh scrape they now are, and are assigned as Chinese (as are many other new titles). I can confirm they are removed if you set Retool to only include English and Italian languages. If you update your clone lists, the updated metadata will also be downloaded and you should see the same.

So the problem here was I simply took too long to update the metadata and things got out of sync, although the root problem is No-Intro not tagging its titles properly with languages.

Also I've noticed that even though the log has:

  • Venice Beach Volleyball (USA) (NINA-06) (Unl)
  • Venice Beach Volleyball (Asia) (En) (Idea-Tek) (Unl)
  • Venice Beach Volleyball (Asia) (En) (Super Mega) (Unl)
  • Venice Beach Volleyball (USA) (Beta) (Unl)
  • Venice Beach Volleyball (USA) (MB-91) (Unl)
  • Volley Ball (Spain) (En) (Idea-Tek) (Unl)

Here's what my log says:

+ Venice Beach Volleyball (USA) (NINA-06) (Unl)
  - Venice Beach Volleyball (Asia) (En) (Idea-Tek) (Unl)
  - Venice Beach Volleyball (Asia) (En) (Super Mega) (Unl)
  - Venice Beach Volleyball (USA) (Beta) (Unl)
  - Venice Beach Volleyball (USA) (MB-91) (Unl)
  - Volley Ball (Spain) (En) (Idea-Tek) (Unl)

This means Venice Beach Volleyball (USA) (NINA-06) (Unl) should make it through to the output DAT, and everything else should be removed, which is what happens:

<game name="Venice Beach Volleyball (USA) (NINA-06) (Unl)">
    <category>Games</category>
    <description>Venice Beach Volleyball (USA) (NINA-06) (Unl)</description>
    <rom name="Venice Beach Volleyball (USA) (NINA-06) (Unl).nes" size="65552" header="4e 45 53 1a 02 04 f1 48 00 00 00 00 00 00 00 01" crc="aa48157c" md5="a1ae3f7bd9fce44165108b5b63a19aa6" sha1="e0d1c129f47d17d5aa4dc4b63c1d8c902bc3c73d" sha256="e36c10895d0bb6e9e04d711a03bc6852a205d0993e930f8a956f50936ac2a675"/>
</game>

I didn't receive any of these roms in the filtered set.

By filtered set do you mean output DAT? Or something else?

I didn't have roms named: Venice Beach Volleyball (USA) (MB-91) (Unl) or Venice Beach Volleyball (USA) (NINA-06) (Unl) only Venice Beach Volleyball (USA) (Unl)

So here we're talking ROM files in this instance? Note that Venice Beach Volleyball (USA) (Unl) isn't in the original No-Intro DAT. So if you've got a file by that name already, you'll need to load Retool's output DAT in a ROM manager like CLRMAMEPro or RomVault and then point it at the folder containing that file to handle renaming, assuming it's a match for a hash in the DAT.

possiblyneal commented 1 year ago

Thank you for the clarification. Based on that information, I think you've made the right call in how Retool handles (Asia) and other titles considered to have unknown language. I wouldn't change anything. If I had known this was the case I would have reviewed the titles and submitted tickets to no-intro instead. I'll do that in the future.

The only other step I could see from here is to somehow detect language from the filename itself. But that's assuming there isn't an option after loading the title to switch to English...

That and the fact that many titles would consist of proper nouns. Meaning most implementations of detecting languages would yield many false positives or false negatives.

By filtered set do you mean output DAT? Or something else?

I was referring to the rebuilt rom set as a result of running the Retool DAT in CMP.

So here we're talking ROM files in this instance? Note that Venice Beach Volleyball (USA) (Unl) isn't in the original No-Intro DAT

I was working with a rom set from mid March 2023. CMP shows that Venice Beach Volleyball (USA) (Unl) has been renamed to Venice Beach Volleyball (USA) (MB-91) (Unl). Again, this may be a misunderstanding of how Retools logic works, but I expected that since retool didn't find the parent rom, Venice Beach Volleyball (USA) (NINA-06) (Unl), it would have chosen the next best option in the clone list, which would have been Venice Beach Volleyball (USA) (MB-91) (Unl). To be clear, I don't know that this necessarily needs to be a feature, just explaining where I was coming from here.

Thanks again for your help and detailed explanations.

unexpectedpanda commented 1 year ago

Ah no, Retool has no awareness of your current ROM files, it only operates on DAT files. There's no way to store fallback ROM information in DATs either to pass it on to CMP to figure out, it just takes the data it's given.

ROM fallback based on your current files is a feature that 1G1R ROM set generator supports, as it has a file management focus instead of a DAT focus. File management isn't in the plan for Retool though.

I'll close this up now as it seems to be solved -- I'll get to the other issues you've posted soon :)