joxeankoret / diaphora

Diaphora, the most advanced Free and Open Source program diffing tool.
http://diaphora.re
GNU Affero General Public License v3.0
3.58k stars 371 forks source link

Diaphora only imports symbols inside instructions only 50% of the time #227

Closed Revester closed 1 year ago

Revester commented 2 years ago

When I import symbols, Diaphora only imports it 50% of the time.

https://user-images.githubusercontent.com/59390647/147412728-a035b0c1-1ef6-4501-8325-88dd83ca3dce.mp4

Recorded a video for clarity. As you can see, I have to import 2 times after the first time in order for it to work. And it will always not work next time after it imported.

I managed to track it with logs, and managed to fix it, I think. When I change the line:

https://github.com/joxeankoret/diaphora/blob/master/diaphora_ida.py#L1410

From is_importable = self.row_is_importable(ea2, import_syms) to is_importable = self.row_is_importable(ea1, import_syms). Somehow the ea1 and ea2 get flipped in the function(there are a lot of things this function does), and by the time it calls row_is_importable, the ea2 variable points to the current effective address in the current database, not the getting-imported effective address from the previous database. I don't know what flips the ea1 and ea2, but after I pass ea1 to the is_importable, it always imports the symbols like comments:

https://user-images.githubusercontent.com/59390647/147413113-736a91cc-28e7-45fa-b689-14a1f15b7967.mp4

When I pass ea2, is_importable is always false in my experience for all the symbols, which is expected because all the addresses are from the current database.

Revester commented 2 years ago

The bug is sitting in the import_instruction_level function in the SQL Query(diaphora_ida.py#L1375):

sql = """select *
                     from (
                   select assembly, assembly_addrs, 1
                     from functions
                    where address = ?
                      and assembly is not null
             union select assembly, assembly_addrs, 2
                     from diff.functions
                    where address = ?
                      and assembly is not null)
                    order by 2 asc"""

The problem is that it gets 2 lines from 2 different databases and tables: the functions table and the diff.functions table. It then saves rows from both tables into 2 variables:

          cur.execute(sql, (str(ea1), str(ea2)))
          diff_rows = cur.fetchall()
          if len(diff_rows) > 0:
            lines1 = diff_rows[0]["assembly"]
            lines2 = diff_rows[1]["assembly"]

The problem is that the rows returned aren't guaranteed to be in some specific order(from what I remember about SQL databases). At first I thought that lines1 were for the current database and lines2 were for diff database, same as ea1 and ea2, but they were flipped.

In my case in my tests the results were flipped at the end: at the start of the function ea1 points to the address in the current database, ea2 - to the diff database.

But at the end of the function:

              if changed or is_importable:
                ea1 = str(ea1)
                ea2 = str(ea2)
                if ea1 in matched_syms and ea2 in import_syms:
                  self.import_instruction(matched_syms[ea1], import_syms[ea2])
                if ea2 in matched_syms and ea1 in import_syms:
                  self.import_instruction(matched_syms[ea2], import_syms[ea1])

ea2 would point to the address in the current database, and ea1 - to the other database. So it would import only 50% of the time. I still don't understand why 50% specifically, but doing 2 different queries for 2 different databases and getting consistent results helped to fix the bug.

joxeankoret commented 1 year ago

Just FYI: I'm working on integrating the PRs you sent long ago. My apologies for taking that long.

joxeankoret commented 1 year ago

I have just integrated your pull requests manually. Thank you very much and sorry for taking so long!