Found a corrupt STDF, and I'm told these might come in regularly...

mgoldste1 commented 11 months ago

So I'm in the validation phase of my converter and I found 1 stdf file that my parser can't handle. I've spent a large portion of my week looking into this file... trying to understand why LinqToStdf can't parse it correctly but two other parsers I'm using can. One of those parsers is QuickEdit. The other was written by someone in my company so I had a chat with him about this.

Basically what happens is, some of these STDFs are coming from questionable sources and can't be relied on fully. There is no way around this. This STDF file has an insane amount of DTRs, which is normal for this product. Something happened while this file was being written and in the middle of a DTR, 4096 zeros are written to disk. I haven't figured out yet if the DTR was interrupted and continued its data after the string of zeros ended or if that cluster on disk was corrupt and the DTR would have ended somewhere in the string of zeros. I attempted to remove the 4096 zeros using EmEditor's binary mode, but it didn't help. Either I did that wrong or it wasn't the first theory. Tomorrow I'm going to try to delete most of the zeros, but leave enough so the correct number of bytes are there for the string.

This is what the DTR's text field looks when I export it to JSON - "Text": "[103 characters redacted]\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"

I believe the \u0000's equate a single 0 each, which makes the field 245 characters long and within the 255 character limit.

The guy who wrote the parser I'm replacing said that with his, it attempts to read a record, but if it can't parse it correctly, it will go back to the end of the previous record, advance 1 byte, then try again. If it fails again, it goes back again and advances 2 bytes... It continues until it actually gets something it can make sense of. That seems like a pretty serious change for this package, but I'm not sure I see another way around this. Since the length is first in the header, maybe it's possible to have it check if the length is zero, then if so assume corruption? Like, can a record actually be a length of zero? If not then this specific issue might be easier to fix than I thought, but this would assume that when the data stops being zeros, it's actually the start of a real record and not in the middle of a different record. Also, since the length is two bytes, the first time it gets a value I suppose it'd have a 50% chance of being right because you don't know if the first byte of the U*2 was actually a zero or not... Also, this would potentially only fix this specific issue. Having it be able to do what I said in the start of this paragraph would be way better.

A little more about this file and what I've done...

When LinqToStdf gets to this bad section, it reads in the DTR, reads in enough zeros to finish where it thinks that record should end, then the next 988 records after that are read in as unknown records of type 0, subtype 0, and length 0. After it runs out of 0s, I believe it's assuming it's starting off at a record header. It sees a type/subtype it doesnt understand, reads in the amount of bytes it's told to read in from the record length field, then moves on under the assumption that the next byte is the start of the next record. The only reason I noticed was because one of these random type/subtypes it read in matched an SBR record type. It tried to parse what it thought was an SBR and failed because the C*n at the end of the record claimed to be longer than the number of bytes left in that record. This is what the order of the records look like in the file. it's formatted as [record short name]([record type]-[record subtype])([quantity of this type of record in a row]) DTR(50-30)(6528) UNKNOWN(0-0)(988) ?UNKNOWN(85-90)(1) ?UNKNOWN(110-52)(1) ?UNKNOWN(107-53)(1) ?UNKNOWN(103-68)(1) ?UNKNOWN(57-89)(1) ?UNKNOWN(122-120)(1) ?UNKNOWN(56-89)(1) ?UNKNOWN(120-109)(1) ?UNKNOWN(76-84)(1) ?UNKNOWN(107-114)(1) ?UNKNOWN(75-70)(1) ?UNKNOWN(74-111)(1) ?UNKNOWN(61-32)(1) ?UNKNOWN(73-72)(1) ?UNKNOWN(104-116)(1) ?UNKNOWN(53-122)(1) ?UNKNOWN(86-104)(1) ?UNKNOWN(101-109)(1) ?UNKNOWN(108-97)(1) ?UNKNOWN(116-73)(1) SBR(1-50)(1) ?UNKNOWN(83-83)(1) ?UNKNOWN(97-99)(1) ?UNKNOWN(79-73)(1) ?UNKNOWN(102-120)(1) ?UNKNOWN(106-85)(1) ?UNKNOWN(47-118)(1)

When I disable SBR parsing, SBRs are treated as unknown records so it doesn't error out at that record. What happens next is it seems to randomly land on the start of a DTR, parses it correctly, and continues on fine. With SBR parsing disabled, it seemed like my converter successfully converted the file, but it actually left out a die because a PRR/PIR was missed. When I inspect the stdf file with EmEditors binary mode, these unknown records contain DTRs, PTRs, a PIR, and a PRR. When I stick this file into QuickEdit, write it out again, then input that file into my converter, it finds the correct number of die.

So I have a manual workaround for the issue, but I can't say how often the stdfs for this product are going to be like this. I had access to the most recent 40ish stdfs and they were all fine so that is promising, but if it happens once, especially during a validation phase, you can put money on the fact that it's going to happen again.

dingetje commented 11 months ago

This STDF is clearly corrupt, and that is of course killing for a format where the next record is found based on the length in the header of the last valid record.

I've been working with STDF files most of my career and the only thing you can do with such corrupt files is attempt to repair them with a hex editor. I don't know EmEditor you mentioned, I used HxD - Freeware Hex Editor

marklio commented 11 months ago

@mgoldste1, your analysis seems correct. Back when I was in the industry, I would often find files with corrupted bytes in the middle that would throw off parsers until they landed on a record start due to "random chance". One of the reasons LinqToStdf produces UnknownRecords is so you can start to detect this corruption rather than ignoring it. LinqToStdf has an additional feature called "rewind and seek". If you enable it, it can rewind the stream when certain corrupt patterns are detected and "search" for plausible record headers and get "back on track" faster than by random chance. This can reduce the number of missing records. @dingetje is correct that careful manual repair of the file is the "best" you can do, but rewind and seek can give you a good balance of extracting as much data as possible while a) knowing that corruption exists (a CorruptRecord is emmitted for the bytes that don't correspond to a readable record) and b) having control over the corruption detection and recovery.

Since I'm not in the industry any longer, testing this feature on "real" corrupted STDFs isn't something I've ever been able to do, which is why I was interested in corrupt stdf files.

mgoldste1 commented 11 months ago

If I can figure out how to manually repair it, I should be able to make you a corrupt one without any company IP in it (if you want).

How does rewind and seek work? I tried calling the RewindAndSeek() method right after creating the stdf file object and that didnt help. I also tried creating the stdf file object, putting an stdfFile.Consume() in a try catch (also tried stdfFile.GetRecords().ToList(), then calling RewindAndSeek/Consume within the catch block but it's still giving the same error. That was with the no cache strategy. With the V4 it acts a little differently.

NoCache strat error - System.IO.EndOfStreamException HResult=0x80070026 Message=Expected 47 more bytes while trying to read 84. Source=LinqToStdf StackTrace: at LinqToStdf.BinaryReader.FillBuffer(Int32 length) in C:\stdf\code\STAN\LinqToStdf\BinaryReader.cs:line 664 at LinqToStdf.BinaryReader.ReadToBuffer(Int32 length, Boolean endianize) in C:\stdf\code\STAN\LinqToStdf\BinaryReader.cs:line 640 at LinqToStdf.BinaryReader.ReadString(Int32 length) in C:\stdf\code\STAN\LinqToStdf\BinaryReader.cs:line 435 at LinqToStdf.BinaryReader.ReadString() in C:\stdf\code\STAN\LinqToStdf\BinaryReader.cs:line 447 at LinqToStdf.RecordConverterFactory.<>c__DisplayClass21_0.b0(UnknownRecord ur) in C:\stdf\code\STAN\LinqToStdf\RecordConverterFactory.cs:line 212 at LinqToStdf.RecordConverterFactory.Convert(UnknownRecord unknownRecord) in C:\stdf\code\STAN\LinqToStdf\RecordConverterFactory.cs:line 181 at LinqToStdf.StdfFile.d56.MoveNext() in C:\stdf\code\STAN\LinqToStdf\StdfFile.cs:line 625 ...

With V4, it generates that exception above, then enters the try catch where it enables rewind and seek, then calls consume again. That second consume generates the following exception -

System.InvalidOperationException HResult=0x80131509 Message=Nested iterations cannot be triggered inside RecordFilter implementations when caching is enabled. Source=LinqToStdf StackTrace: at LinqToStdf.Indexing.CachingIndexingStrategy.LinqToStdf.Indexing.IIndexingStrategy.CacheRecords(IEnumerable1 records) in C:\stdf\code\STAN\LinqToStdf\Indexing\IIndexingStrategy.cs:line 45 at LinqToStdf.StdfFile.<GetRecordsEnumerable>d__48.MoveNext() in C:\stdf\code\STAN\LinqToStdf\StdfFile.cs:line 421 at System.Linq.Enumerable.Count[TSource](IEnumerable1 source) in //src/libraries/System.Linq/src/System/Linq/Count.cs:line 38 at System.Linq.EnumerableExecutor`1.Execute() in //src/libraries/System.Linq.Queryable/src/System/Linq/EnumerableExecutor.cs:line 47 at System.Linq.EnumerableQuery1.System.Linq.IQueryProvider.Execute[TElement](Expression expression) in /_/src/libraries/System.Linq.Queryable/src/System/Linq/EnumerableQuery.cs:line 97 at LinqToStdf.StdfFile.Queryable1.Execute[TResult](Expression expression) in C:\stdf\code\STAN\LinqToStdf\StdfFile.cs:line 387 at LinqToStdf.StdfFile.Consume() in C:\stdf\code\STAN\LinqToStdf\StdfFile.cs:line 443 at Program.

$(String[] args) in C:\stdf\code\STAN\RandomTests\Program.cs:line 31

Note the line numbers in the exceptions probably wont line up for some of the files

marklio commented 11 months ago

When you say "it enters the try catch where it enables rewind and seek.", is that a try/catch you've created? Or something else?

mgoldste1 commented 11 months ago

I put a breakpoint in the InternalGetAllRecords's _InSeekMode area and it never triggers.

I tried this - StdfFile sf = new(@"bane of my existence.stdf", new SimpleIndexingStrategy()); //tried v4 and noncache too try { sf.GetRecords(); //also tried sf.GetRecords().ToList() and sf.Consume() }catch(EndOfStreamException ex) { sf.RewindAndSeek(); sf.AddSeekAlgorithm(SeekAlgorithms.LookForPirs); sf.GetRecords().ToList(); //also tried sf.GetRecords().ToList() and sf.Consume() }

And this -

StdfFile sf = new(@"bane of my existence.stdf", new SimpleIndexingStrategy()); //tried v4 and noncache too sf.GetRecords(); sf.RewindAndSeek(); sf.AddSeekAlgorithm(SeekAlgorithms.LookForPirs); sf.GetRecords();

And this - StdfFile sf = new(@"bane of my existence.stdf", new SimpleIndexingStrategy()); //tried v4 and noncache too sf.RewindAndSeek(); sf.AddSeekAlgorithm(SeekAlgorithms.LookForPirs); sf.GetRecords();

and this - StdfFile sf = new(@"bane of my existence.stdf", new V4StructureIndexingStrategy()); sf.AddSeekAlgorithm(SeekAlgorithms.LookForPirs); sf.GetRecords(); sf.RewindAndSeek(); var orderedbyname = sf.GetRecords().GroupBy(o=>o.GetRecordType_Safe()).Select(o=>new { o.Key,cnt = o.Count() }).OrderBy(o=>o.Key).ToList();

As you can see, I have no idea what I am doing :)

Lot more 0-0 records, then it goes to this -

At that point it hits the sbr that it can't parse because it isnt actually an sbr.

marklio commented 11 months ago

Yeah, RewindAndSeek has to be done as part of the "filter stack". Otherwise, you'll hit the enumeration re-entrancy check. I think the only built-in filter that triggers RewindAndSeek is the "ExpectOnlyKnownRecords" filter, which will trigger RewindAndSeek if any UnknownRecords are encountered. You can add this filter, and it should do what you want. RewindAndSeek should probably make sure that things are in a context in which it will work, and give you a diagnostic message. Obviously, if I was writing this all over again, I'd do it pretty differently :)

marklio commented 11 months ago

I also don't remember if you need to specify a seek algorithm. There are built-in ones, but by default you might just get the "do nothing" one. Unfortunately, I don't have time to go digging around today. :(

mgoldste1 commented 11 months ago

The built in seek algorithm says it does nothing so I was adding in the example PIR one too. I should be able to get away with losing data for 1 die, but losing the entire wafer's data isn't something they'd accept. Depending on how frequent this happens, telling them to put the stdf through quickedit might be an acceptable solution, but it's hard to say how often this is going to happen... If it's frequent then that isn't acceptable. I'm scared that we'll find out this is too frequent a few months after my converter takes over there.

I'm trying to get this rewind and seek thing working but I'm still failing at it. You're right that you need to feed it a filter for it to trigger the code, but the problem is with the ExpectOnlyKnownRecords filter, it now throws an exception when it finds the corruption. The rewind and seek PIR algorithm is claiming to find the next PIR, but it doesn't resume parsing the file. The other problem is that my code is pairing up PIRs with PRRs using the PIR.GetPRR() method, which in the case of a corrupt file it would return the wrong PRR because the correct PRR was skipped. I might be able to have it detect if it finds another PIR with the same head/site as the current one before a PRR... but getting it to work with the V4 index strategy might be problematic... I don't think you can make a rewind and seek that finds a PRR because the record length is not static. The only way I think you could do that is to first find the PIR, then continue going in reverse while trying to find a PRR that matches the rec type/subtype/number of bytes you've rewound. Problem there is if it's a multi-dut file you'd still potentially miss a prr... It'd need to know how many PIRs came after the previous PRR so it knows how many to find going in reverse... It seems like a massive headache. It almost seems easier to add code to advance one byte at a time every time it finds an unknown record until it find a correct rec type/subtype and was able to correctly parse it. Honestly, I am not sure if I'd be capable of doing that... but I think that would be the right approach to cover the most use cases possible.

As for getting the rewinding and resuming to work... this is the code I was using.


StdfFile sf = new(@"filename.stdf", new V4StructureIndexingStrategy()); (tried non cache and simple too)
sf.AddFilter(BuiltInFilters.ExpectOnlyKnownRecords);
sf.AddSeekAlgorithm(SeekAlgorithms.LookForPirs);
var Pirs = sf.GetRecords().OfExactType<Pir>().Count();

That spits out the following: LinqToStdf.StdfFormatException HResult=0x80131500 Message=5487476 bytes of corrupt data found at offset 159051918. Source=LinqToStdf StackTrace: at LinqToStdf.BuiltInFilters.d20.MoveNext() in C:\stdf\code\STAN\LinqToStdf\BuiltInFilters.cs:line 271 at LinqToStdf.BuiltInFilters.d34.MoveNext() in C:\stdf\code\STAN\LinqToStdf\BuiltInFilters.cs:line 426 at LinqToStdf.Indexing.V4StructureIndexingStrategy.IndexRecords(IEnumerable1 records) in C:\stdf\code\STAN\LinqToStdf\Indexing\V4StructureIndexingStrategy.cs:line 149 at LinqToStdf.Indexing.CachingIndexingStrategy.LinqToStdf.Indexing.IIndexingStrategy.CacheRecords(IEnumerable1 records)

The offset is correct, and the number of bytes seems right... but even if I get that working I still have the problem of getting the PRRs before that PIR.

Honestly any help is greatly appreciated here. I've learned how various parts of your code work and expanded on it, but that record iterator is some next level shit... lol

marklio commented 11 months ago

Corrupt STDFs are really a test vendor hardware/software issue or a local infrastructure problem (bad file copies, etc.). IMO, there's not really a good excuse for that kind of problem in 2023. LinqToStdf tries to enable maximal recovery of data, but there can definitely be data corrupted to a degree that makes recovery difficult, and there are likely patterns that make recover.

If I had the actual file, I could likely determine if this is the best the library could do (it seems unlikely that there is actually 5MB of corruption), and make a proposal for how to best handle the kind of corruption that's present. I'm happy to sign an NDA.

However, I would be focused on investigating and determining sources of corruption and using LinqToStdf to get as much data as possible, but it can't repair STDFs reliably. It certainly can't be relied upon

mgoldste1 commented 11 months ago

I'm 100% with you on the fact that this shouldn't be my problem. From the customer's standpoint, the old converter parsed it, therefore the new one should too. We have a meeting on Monday with the guy and I had a prep meeting today with the folks on my side so we are all on the same page before going into it. We all agree it shouldn't be our problem, but god forbid months down the line we start getting a swarm of these, we're setting ourselves up for failure.

I bought up the fact that if I could send you the file it'd make things way easier. It's one from a super critical client though and they didn't even humor the idea of it. Like, we'd be breaking our NDA with them by sending it to someone else. My best bet is to try and replicate the issue in an stdf that doesn't contain any IP. I can definitely make one that replicates this by simply dumping 4kb worth of zeros into the middle of a DTR, but I don't know if I'd be able to replicate the whole randomly finding a record header for an SBR and trying to parse that record.

The corruption is definitely not 5mb long. It's 4kb. It's coming up as 5mb because the rewind and seek is looking for the next PIR, then declaring everything between the start of the corruption to the PIR as being corrupt. That is plausible.

marklio commented 11 months ago

Cool. I'm happy to sign their NDA as well. Alternatively, you could hire me as a consultant. :) I'm willing to help and open to creative options.

In the meantime, it would be pretty easy to write your own seek algorithm that looked for other records. 5MB is a huge amount of data. I guess maybe this is massively parallel waferprobe (lots of parts), or just an enormous test program :)

mgoldste1 commented 11 months ago

Honestly the 5mb isn't as much as you'd think. most of the data in these stdfs are in DTRs which contain encrypted data we don't have access to. There are only about 50 PTRs per die and then 30k DTRs full of garbage. We are losing 1 PRR/PIR and somewhere between 1 ptr and 100 ptrs.

Just before you posted this comment I sent my boss an IM saying "what if we could pay this guy to help us". He hasn't responded yet so I don't have an answer. My gut says they will tell the customer on Monday that corrupt stdfs are their problem, then 6 months down the road when we get a string of these, that's when they'd approve it... but this customer is persuasive so it could go either way.

If the answer comes back as it being a possibility, I think the best solution is to try parsing a record, then if it ends up being an unknown record or the conversion throws an exception (like the SBR case in this file), rewind to 1 byte after the start of the previous known good header and try again. Loop until it successfully parses something and continue as normal. It definitely isn't perfect but it'd greatly reduce the chances of it happening. Losing some speed shouldn't be an issue. My converter is at least 60x faster than the old one and I could easily double that, but I was told it not to.

I'll let you know when we get an answer to the consult question.

marklio commented 11 months ago

Ah, super interesting. That framing makes more sense. The encryption piece is fascinating.

mgoldste1 commented 11 months ago

My manager wasn't against this idea, but he also wasn't sure how to handle it. We've gotten microsoft/vendors to help with certain things in the past but this is a bit different than that. He's looking into if it's allowed.

marklio / LinqToStdf

Found a corrupt STDF, and I'm told these might come in regularly... #30