sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
924 stars 217 forks source link

Open Financial Exchange (OFX) Parser #1908

Closed gfd2020 closed 7 months ago

gfd2020 commented 11 months ago

Implementation of OFX file parser. At first it only implements Signon, Banking, Credit Card messages (there are others: ,Payments, Investments, etc). The files are transformed into Excel spreadsheets.

Test cases:

https://github.com/ofx-reader/ofx-reader/tree/main/ofx-files

PS ( Edit): OFC ( Microsoft Money ) now supported.

Issue reference #1905

lfcnassif commented 11 months ago

Thank you @gfd2020 for this contribution!

patrickdalla commented 10 months ago

Hi.

I have downloaded this samples from https://github.com/csingley/ofxtools/tree/master/tests/data

I will test with them and register the results soon.

patrickdalla commented 10 months ago

For the file invstmtrs.ofx only sign on information was parsed.

The INVSTMTMSGSRSV1 and SECLISTMSGSRSV1 OFX tags weren't parsed. It seems they contain information of investment operations on stock market, like orders, position, etc. Maybe it wasn't the scope of this parser.

patrickdalla commented 10 months ago

For profrs.ofx also only signon info was parsed. It seems it does no contain financial transactions info at all. Maybe it wasn't the scope of this parser.

patrickdalla commented 10 months ago

For stmtrs.ofx and stmtrs_euro.ofx all the 2 transactions in each file were correctly parsed. It seems that all meaningful information was inserted in the resulting XLS. It seems the unique difference between the files is the signal used as decimal separator . for the first and , for the last.

patrickdalla commented 10 months ago

I found some more samples at: https://github.com/ofx-reader/ofx-reader/tree/main/ofx-files. I will test them.

patrickdalla commented 10 months ago

For amex.ofx, only singon info were parsed. It seems it does not parses CREDITCARDMSGSRSV1 tags.

patrickdalla commented 10 months ago

For the files BancodoBrasil.ofc and BancodoBrasil.ofx, it seems all transactions informations were correctly parsed and converted inserted in the resulting XLS.

patrickdalla commented 10 months ago

Correction: Correct charset doesn't seem to be used in the final XLS. image

patrickdalla commented 10 months ago

@gfd2020 All other files have the same pattern of parse results. BANKMSGSRSV1 internal information were correctly parsed. It seems also important to parse CREDITCARDMSGSRSV1 tags, at least, as it contains credit card transactions, but also INVSTMTMSGSRSV1 seems important.

Also charset must be set accordingly.

gfd2020 commented 10 months ago

@gfd2020 All other files have the same pattern of parse results. BANKMSGSRSV1 internal information were correctly parsed. It seems also important to parse CREDITCARDMSGSRSV1 tags, at least, as it contains credit card transactions, but also INVSTMTMSGSRSV1 seems important.

Also charset must be set accordingly.

Hi @patrickdalla . I agree that the other statments are important. I will try to implement them. I will also look at encoding errors. Thanks.

gfd2020 commented 9 months ago

Hi @patrickdalla.

I managed to implement credit card transactions. Investments are very large and complex. I think it's best to leave these for future work.

PS: I was unable to reproduce the character error you had. Here, for me, in IPED and Excel the characters appear correct.

image

patrickdalla commented 7 months ago

Hi, @gfd2020. I solved the charset problem in some of the files opening the InputStreamReader as follows: Reader reader = new InputStreamReader(inputStream, "windows-1252");

But I think it won't solve every distinct case.

patrickdalla commented 7 months ago

I saw that this info is in the beggining of OFX file, but the AggregateUnmarshaller object ignores this info:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE

<OFX>
   <SIGNONMSGSRSV1>
      <SONRS>
         <STATUS>
            <CODE>0</CODE>
            <SEVERITY>INFO</SEVERITY>
         </STATUS>
         <DTSERVER>20160626120000[-3:BRT]</DTSERVER>
         <LANGUAGE>POR</LANGUAGE>
         <FI>
            <ORG>Banco do Brasil</ORG>
            <FID>1</FID>
         </FI>
      </SONRS>
   </SIGNONMSGSRSV1>
   <BANKMSGSRSV1>
      <STMTTRNRS>
         <TRNUID>1</TRNUID>
         <STATUS>
            <CODE>0</CODE>
            <SEVERITY>INFO</SEVERITY>
         </STATUS>
         <STMTRS>
            <CURDEF>BRL</CURDEF>
            <BANKACCTFROM>
               <BANKID>1</BANKID>
               <BRANCHID>1234-1</BRANCHID>
               <ACCTID>54321-9</ACCTID>
               <ACCTTYPE>CHECKING</ACCTTYPE>
            </BANKACCTFROM>
            <BANKTRANLIST>
               <DTSTART>20160530120000[-3:BRT]</DTSTART>
               <DTEND>20160627120000[-3:BRT]</DTEND>
               <STMTTRN>
                  <TRNTYPE>OTHER</TRNTYPE>
                  <DTPOSTED>20160601120000[-3:BRT]</DTPOSTED>
                  <TRNAMT>-10.00</TRNAMT>
                  <FITID>2016060111650</FITID>
                  <CHECKNUM>000391100701</CHECKNUM>
                  <REFNUM>391.100.701</REFNUM>
                  <MEMO>Cobrança de I.O.F.</MEMO>
               </STMTTRN>
               <STMTTRN>
                  <TRNTYPE>OTHER</TRNTYPE>
                  <DTPOSTED>20160602120000[-3:BRT]</DTPOSTED>
                  <TRNAMT>880.00</TRNAMT>
                  <FITID>2016060202176000</FITID>
                  <CHECKNUM>000000121482</CHECKNUM>
                  <REFNUM>121.482</REFNUM>
                  <MEMO>Recebimento de Proventos</MEMO>
               </STMTTRN>
               <STMTTRN>
                  <TRNTYPE>OTHER</TRNTYPE>
                  <DTPOSTED>20160602120000[-3:BRT]</DTPOSTED>
                  <TRNAMT>-400.00</TRNAMT>
                  <FITID>201606021400000</FITID>
                  <CHECKNUM>756571580758</CHECKNUM>
                  <REFNUM>21.756.571.580.758</REFNUM>
                  <MEMO>Saque no TAA - 02/06 17:56</MEMO>
               </STMTTRN>
               <STMTTRN>
                  <TRNTYPE>OTHER</TRNTYPE>
                  <DTPOSTED>20160603120000[-3:BRT]</DTPOSTED>
                  <TRNAMT>120.00</TRNAMT>
                  <FITID>201606030137050</FITID>
                  <CHECKNUM>000000000123</CHECKNUM>
                  <REFNUM>123</REFNUM>
                  <MEMO>Recebimento Fornecedor</MEMO>
               </STMTTRN>
               <STMTTRN>
                  <TRNTYPE>OTHER</TRNTYPE>
                  <DTPOSTED>20160603120000[-3:BRT]</DTPOSTED>
                  <TRNAMT>-200.00</TRNAMT>
                  <FITID>20160603149980</FITID>
                  <CHECKNUM>000000141658</CHECKNUM>
                  <REFNUM>141.658</REFNUM>
                  <MEMO>Compra com Cartão - 03/06 11:34 LOJAS X</MEMO>
               </STMTTRN>
               <STMTTRN>
                  <TRNTYPE>OTHER</TRNTYPE>
                  <DTPOSTED>20160603120000[-3:BRT]</DTPOSTED>
                  <TRNAMT>-200.00</TRNAMT>
                  <FITID>201606031433760</FITID>
                  <CHECKNUM>00010758</CHECKNUM>
                  <REFNUM>141.658</REFNUM>
                  <MEMO>Pagto cartão crédito</MEMO>
               </STMTTRN>
               <STMTTRN>
                  <TRNTYPE>OTHER</TRNTYPE>
                  <DTPOSTED>20160603120000[-3:BRT]</DTPOSTED>
                  <TRNAMT>-200.00</TRNAMT>
                  <FITID>201606031433760</FITID>
                  <CHECKNUM>00010758</CHECKNUM>
                  <REFNUM>141.658</REFNUM>
                  <MEMO>Pagto cartão crédito</MEMO>
               </STMTTRN>
            </BANKTRANLIST>
            <LEDGERBAL>
               <BALAMT>-10.00</BALAMT>
               <DTASOF>20160627120000[-3:BRT]</DTASOF>
            </LEDGERBAL>
         </STMTRS>
      </STMTTRNRS>
   </BANKMSGSRSV1>
</OFX>
patrickdalla commented 7 months ago

It seems that the parser ignores the Charset info in OFX header. As windows-1252 is default charset for JVM in windows SO, it works perfectly for you. I run on Linux, on which JVM uses UTF-8 as default charset.

gfd2020 commented 7 months ago

It seems that the parser ignores the Charset info in OFX header. As windows-1252 is default charset for JVM in windows SO, it works perfectly for you. I run on Linux, on which JVM uses UTF-8 as default charset.

Thanks. Maybe I can get the JVM's default charset and pass it to the InputStreamReader...

wladimirleite commented 7 months ago

Thanks. Maybe I can get the JVM's default charset and pass it to the InputStreamReader...

I guess you need to get the charset from the file header, right?

gfd2020 commented 7 months ago

Thanks. Maybe I can get the JVM's default charset and pass it to the InputStreamReader...

I guess you need to get the charset from the file header, right?

In fact I would get the JVM's default charset...

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/charset/Charset.html#defaultCharset()

patrickdalla commented 7 months ago

wladmir is right. the charset is already taken from default jvm config, but it must be configured with the specific ofx header

Em seg., 8 de jan. de 2024 16:36, gfd2020 @.***> escreveu:

Thanks. Maybe I can get the JVM's default charset and pass it to the InputStreamReader...

I guess you need to get the charset from the file header, right?

In fact I would get the JVM's default charset...

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/charset/Charset.html#defaultCharset()

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/pull/1908#issuecomment-1881782242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S24AEDTCUJSRBGAGULYNRKELAVCNFSM6AAAAAA5MZBAHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRG44DEMRUGI . You are receiving this because you were mentioned.Message ID: @.***>

patrickdalla commented 7 months ago

The OFX library reads this headers with processOFXv2Headers and processOFXv1Headers, but doen't make any operation with these informations.

patrickdalla commented 7 months ago

In OFC sample file, the charset encoding is in CPAGE tag.

<OFC>
   <DTD>2</DTD>
   <CPAGE>1252</CPAGE>
   <ACCTSTMT>
      <ACCTFROM>
         <BANKID>001</BANKID>
         <BRANCHID>1234-13</BRANCHID>
         <ACCTID>00000543219</ACCTID>
         <ACCTTYPE>0</ACCTTYPE>
      </ACCTFROM>
      <STMTRS>
         <DTSTART>20160530</DTSTART>
         <DTEND>20160627</DTEND>
         <LEDGER>-10.00</LEDGER>
         <STMTTRN>
            <TRNTYPE>1</TRNTYPE>
            <DTPOSTED>20160601</DTPOSTED>
            <TRNAMT>-10.00</TRNAMT>
            <FITID>2016060111650</FITID>
            <CHKNUM>91100701</CHKNUM>
            <MEMO>Cobran�a de I.O.F.</MEMO>
         </STMTTRN>
         <STMTTRN>
            <TRNTYPE>0</TRNTYPE>
            <DTPOSTED>20160602</DTPOSTED>
            <TRNAMT>880.00</TRNAMT>
            <FITID>2016060202176000</FITID>
            <CHKNUM>00121482</CHKNUM>
            <MEMO>Recebimento de Proventos</MEMO>
         </STMTTRN>
         <STMTTRN>
            <TRNTYPE>1</TRNTYPE>
            <DTPOSTED>20160602</DTPOSTED>
            <TRNAMT>-400.00</TRNAMT>
            <FITID>201606021400000</FITID>
            <CHKNUM>71580758</CHKNUM>
            <MEMO>Saque no TAA - 02/06 17:56</MEMO>
         </STMTTRN>
         <STMTTRN>
            <TRNTYPE>0</TRNTYPE>
            <DTPOSTED>20160603</DTPOSTED>
            <TRNAMT>120.00</TRNAMT>
            <FITID>201606030137050</FITID>
            <CHKNUM>00000123</CHKNUM>
            <MEMO>Recebimento Fornecedor</MEMO>
         </STMTTRN>
         <STMTTRN>
            <TRNTYPE>1</TRNTYPE>
            <DTPOSTED>20160603</DTPOSTED>
            <TRNAMT>-200.00</TRNAMT>
            <FITID>20160603149980</FITID>
            <CHKNUM>00141658</CHKNUM>
            <MEMO>Compra com Cart�o - 03/06 11:34 LOJAS X</MEMO>
         </STMTTRN>
         <STMTTRN>
            <TRNTYPE>1</TRNTYPE>
            <DTPOSTED>20160603</DTPOSTED>
            <TRNAMT>-200.00</TRNAMT>
            <FITID>201606031433760</FITID>
            <CHKNUM>00010758</CHKNUM>
            <MEMO>Pagto cart�o cr�dito</MEMO>
         </STMTTRN>
         <STMTTRN>
            <TRNTYPE>1</TRNTYPE>
            <DTPOSTED>20160605</DTPOSTED>
            <TRNAMT>-200.00</TRNAMT>
            <FITID>201606031433760</FITID>
            <CHKNUM>00010758</CHKNUM>
            <MEMO>Pagto cart�o cr�dito</MEMO>
         </STMTTRN>
      </STMTRS>
   </ACCTSTMT>
</OFC>
patrickdalla commented 7 months ago

This is a better approach: implement a findCharset method for both OFC and OFX.

The OFC version would be:


    public Charset findCharset(File file) throws IOException {
        /* discover charset */
        FileInputStream inputStream = new FileInputStream(file);
        Reader reader = new InputStreamReader(inputStream);
        BufferedReader rd = new BufferedReader(reader);
        Pattern pattern = Pattern.compile("\\<CPAGE\\>(.*)\\<\\/CPAGE\\>");
        Matcher matcher = pattern.matcher("\\D");

        Charset result = Charset.defaultCharset();

        String line = null;
        while ((line = rd.readLine()) != null) {
            matcher.reset(line);
            if (matcher.find()) {
                String cpage = matcher.group(1);
                try {
                    return Charset.forName(cpage);
                } catch (Exception e) {
                    try {
                        return Charset.forName("windows-" + cpage);
                    } catch (Exception e2) {
                        // TODO: handle exception
                    }
                }
            }
        }

        return result;
    }

while the OFX:

    public Charset findCharset(File file) throws IOException {
        /* discover charset */
        FileInputStream inputStream = new FileInputStream(file);
        Reader reader = new InputStreamReader(inputStream);
        BufferedReader rd = new BufferedReader(reader);
        Pattern pattern = Pattern.compile("^CHARSET\\:(.*)");
        Matcher matcher = pattern.matcher("\\D");

        Charset result = Charset.defaultCharset();

        String line = null;
        while ((line = rd.readLine()) != null) {
            matcher.reset(line);
            if (matcher.find()) {
                String cpage = matcher.group(1);
                try {
                    return Charset.forName(cpage);
                } catch (Exception e) {
                    try {
                        return Charset.forName("windows-" + cpage);
                    } catch (Exception e2) {
                        // TODO: handle exception
                    }
                }
            }
        }

        return result;
    }
gfd2020 commented 7 months ago

This is a better approach: implement a findCharset method for both OFC and OFX.

Thanks @patrickdalla . I hadn't really understood the charset issue. After you explained @wladimirleite comment, I understood. I was going to do a similar method to what you did. I believe that the parser is now correct...

patrickdalla commented 7 months ago

@gfd2020 , I think I pushed correctly the commit to your repo. Please, check it, and make some last tests. After that, I think it will be enough to approve.

gfd2020 commented 7 months ago

@gfd2020 , I think I pushed correctly the commit to your repo. Please, check it, and make some last tests. After that, I think it will be enough to approve.

Hi @patrickdalla , I did the tests here and it seems to be ready. I just made a small update to handle the OFX type V2 encoding.

gfd2020 commented 7 months ago

Approved after adding warn of unimplemented investment section and internationalization. I would appreciate the review of the Portuguese translation of the used strings, @gfd2020 .

I think the translation is ok. I don't know if there is an official translation, the OFX documentation is all in English. What I can do is see what is most used in the examples we have.

patrickdalla commented 7 months ago

Thank you @gfd2020.

I included item file path info in parsing errors log information.

@lfcnassif, this PR is ok for me. The merge and any final revision is up to you from now on.

lfcnassif commented 7 months ago

Hi @gfd2020 and @patrickdalla, I finished my static code review. I'll try to collect samples to run and test the 2 new parsers, but I would appreciate if you could send me some. Please take a look at my changes to confirm I didn't break something accidentally.

wladimirleite commented 7 months ago

I'll try to collect samples to run and test the 2 new parsers, but I would appreciate if you could send me some.

In a quick search here, I found ~50 OFX from two different cases. I am sending you through Teams.

lfcnassif commented 7 months ago

Thank you @wladimirleite!

lfcnassif commented 7 months ago

Hi @gfd2020 and @patrickdalla! Tested this with ~400 samples crawled from past cases, including 50 @wladimirleite sent me. I pushed the last 2 commits to fix issues I've found. I think this is good now, so I'm merging.

A possible improvement would be to extract each single money transaction as a separate metadata item to populate the timeline and the time chart, but I'll leave this as a future enhancement.

Thank you @gfd2020 for this PR and @patrickdalla for your review!

wladimirleite commented 6 months ago

A very minor detail, there is one message property duplicated "OFXParser.TransactionEndDate". I am removing it in another PR (WhatsApp Enhancements), as I am editing these property files there.

wladimirleite commented 5 months ago

Tested this with ~400 samples crawled from past cases, including 50 @wladimirleite sent me.

I found 2 OFX files in a case I am working on, but noticed that they were not recognized as OFX. Testing with the 50 samples I sent when this PR was being tested, only 2 of them were identified as OFX. It seems that there is a newline before "OFXHEADER:100", but the defined file signature expects it in offset 0. @lfcnassif, can you check if your samples are processed correctly?

lfcnassif commented 5 months ago

Sure!

lfcnassif commented 5 months ago

Hi @wladimirleite, you're right! From 333 OFX v1, 129 weren't being detected properly because of the starting newline. Detection based on extension wasn't working because OFX v1 should sub class text/plain. Just pushed the fix to master. Fortunately, no new exceptions were thrown on the new detected files and seems they were parsed fine. Thanks for warning about this!