jermp / sshash

A compressed, associative, exact, and weighted dictionary for k-mers.
MIT License
83 stars 17 forks source link

Support for compiling on M1/M2 Macs #22

Closed rob-p closed 1 year ago

rob-p commented 1 year ago

Hi @jermp,

Right now, compilation fails on M1 and M2 Macs for several reasons. First, the current version of Xcode doesn't support -march=native on these machines. Clang does upstream, so this will go away eventually, but it would be good for the CMakeFiles using this to have a flag one could pass to remove this flag.

Second, there are several places where x86 intrinsics are used (and at least one place where inline ASM is used). The intrinsics could be made portable with simde, and the assembly could likely be tested too (not sure if that actually causes a problem or not but I can't get to it yet because of the other intrinsics).

I think it would be worth the (hopefully) minor changes to be able to support sshash (and hence piscem) compilation directly on M1/M2 hardware!

Best, Rob

jermp commented 1 year ago

Hi @rob-p, I have recently bought a Mac mini (for my office) with an Apple M1 processor. I haven't tested SSHash there but will do on monday.

I've just created a new branch here https://github.com/jermp/sshash/tree/apple-m-1-2-support with a possible fix. Can you try to compile it now and let me know if everything goes well?

PS. We are doing the same for LPHash (https://github.com/jermp/lphash/blob/main/CMakeLists.txt#L17) and it works. I've tested LPHash on M1.

rob-p commented 1 year ago

Hi @jermp,

This does seem to compile fine on my M1 MacBook Pro! I've not had a chance to test running it yet thought but will report back here once I get to that.

--Rob

jermp commented 1 year ago

Excellent @rob-p, thanks!

rob-p commented 1 year ago

Hi @jermp,

So it starts to work, but then I get an exception.

❯ ./sshash build -i chr16_unitigs.fa -k 31 -m 19 --canonical-parsing -o chr16_sshash_idx
k = 31, m = 19, seed = 1, l = 6, c = 3, canonical_parsing = true, weighted = false
reading file 'chr16_unitigs.fa'...
m_buffer_size 29411764
read 100000 sequences, 5116928 bases, 2116958 kmers
read 200000 sequences, 11744174 bases, 5744204 kmers
read 300000 sequences, 19778101 bases, 10778131 kmers
read 400000 sequences, 28132903 bases, 16132933 kmers
read 500000 sequences, 38354108 bases, 23354138 kmers
read 600000 sequences, 51104336 bases, 33104366 kmers
read 700000 sequences, 67296063 bases, 46296093 kmers
read 800000 sequences, 83511079 bases, 59511109 kmers
sorting buffer...
saving to file './sshash.tmp.run_2658081298525333.minimizers.0.bin'...
read 880192 sequences, 96582477 bases, 70176717 kmers
num_kmers 70176717
num_super_kmers 13461698
num_pieces 880193 (+0.752551 [bits/kmer])
=== step 1: 'parse_file' 5.59785 [sec] (79.7679 [ns/kmer])
libc++abi: terminating with uncaught exception of type pthash::seed_runtime_error: seed did not work

This is with the unitigs from chr16 (computed using cuttlefish) as the input. Not sure why this might happen, but happy to dig deeper if you have any ideas.

jermp commented 1 year ago

Mmmh that's strange. Can you try to see if PTHash (https://github.com/jermp/pthash) works correctly on M1? Sorry I haven't tried it yet since I do not have an M1 processor on my laptop. It should work indeed since we are also using it on LPHash which it has been tested on M1.

That exception is related to hash collisions....so another thing to try is to use 128-bit hash codes and see if the behavior persists: https://github.com/jermp/sshash/blob/apple-m-1-2-support/include/util.hpp#L26. That being said, collisions should not happen there!

jermp commented 1 year ago

Hi @rob-p, ok I've tested PTHash on M1 and everything works fine as expected.

Compiling SSHash with cmake .. -D CMAKE_BUILD_TYPE=Debug -D SSHASH_USE_SANITIZERS=OFF and trying this test command ./sshash build -i ../data/unitigs_stitched/salmonella_enterica_k31_ust.fa.gz -k 31 -m 13 --check -o salmonella_enterica.index, I have this assert failure: Assertion failed: (taken.get(p) == false), function operator(), file search.hpp, line 252., which is indeed in the PTHash code. So the problem should be related to some collisions happening during the parsing of minimizers on M1.

Instead, if we compile with cmake .. -D CMAKE_BUILD_TYPE=Debug -D SSHASH_USE_SANITIZERS=ON (sanitizers enabled), everything works well.

The problem is only on M1 with SSHASH_USE_SANITIZERS=OFF. At least it seems to be reproducible. I'll keep you posted.

jermp commented 1 year ago

ok I've tested PTHash on M1 and everything works fine as expected.

I've investigated some more and, actually, the problem seems the parallel mode of PTHash on M1. With 1 thread all is good.

This means that the SSHash code is correct, no collisions during minimizers' calculation.

Can you confirm this by using 1 thread here https://github.com/jermp/sshash/blob/apple-m-1-2-support/include/minimizers.hpp#L17 and here https://github.com/jermp/sshash/blob/apple-m-1-2-support/include/builder/build_skew_index.hpp#L136 on your chr16 example?

Thanks Rob!

rob-p commented 1 year ago

sure! will try and report back.

rob-p commented 1 year ago

Confirmed. If I use 1 thread in both of these places, I can build the index successfully.

jermp commented 1 year ago

See also this commit https://github.com/jermp/sshash/commit/d603d2522f34db8a625b889fb9d18d9b827adbe2. In short: I'm building partitioned MPHFs on ARM. All good and we scale even better (but will have a minor slowdown in lookup).

rob-p commented 1 year ago

Cool! Any idea why it otherwise breaks on ARM?

rob-p commented 1 year ago

I had one more thought /question. Right now, we can’t distribute sshash-based code via bioconda that will run on M1/2 machines. This is because bioconda is a bit weird. Currently, they don’t have any native M1/2 build hosts, so everything is compiled as x86_64. Usually, that’s OK (not optimal, but OK), because you can just install packages in an x86_64 environment and run them under rosetta 2.

The problem is, that doesn’t work with this code. We immediately get an illegal instruction error. You can see this, e.g. if you pull down the latest piscem from bioconda.

One question I had is, what compiler flags do we actually need? For example, rosetta 2 handles many intrinsics fine (e.g. SSE4.2), but not others (e.g. AVX512). I am not certain if it handles BMI2 or not.

My thought is, if we figured out what instructions it can handle — we could perhaps add a “CONDA_BUILD” flag to the CMake scripts that would skip march=native, but would still pass the most essential performance flags to the compiler. This would produce (ideally) an x86_64 executable that may not be quite as optimized, but which would run on both x86_64 and M1/2 Macs. Let me know if you have any questions or thoughts on this.

Thanks! Rob

jermp commented 1 year ago

Cool! Any idea why it otherwise breaks on ARM?

Fixed! https://github.com/jermp/pthash/commit/348fe1d5d32bc124369fd372d3a8b6bea4d66d05

Already integrated on LPHash. I'm doing the same for SSHash and supersede https://github.com/jermp/sshash/commit/d603d2522f34db8a625b889fb9d18d9b827adbe2.

jermp commented 1 year ago

Ok, done. See the branch https://github.com/jermp/sshash/tree/apple-m-1-2-support. All good on my end. Can you confirm it works on yours as well? Thanks Rob!

jermp commented 1 year ago

Oh, by the way, I'm about to also support larger k, up to 63. (Right now, under dev.)

I think it will become increasingly useful to support k > 31. The space for SSHash will also become way smaller since, for fixed m, the number of super-k-mers will decrease. Results are encouraging: https://github.com/jermp/sshash/issues/21#issuecomment-1407689873.

Happy to know what you think!

jermp commented 1 year ago

The problem is, that doesn’t work with this code. We immediately get an illegal instruction error. You can see this, e.g. if you pull down the latest piscem from bioconda.

One question I had is, what compiler flags do we actually need? For example, rosetta 2 handles many intrinsics fine (e.g. SSE4.2), but not others (e.g. AVX512). I am not certain if it handles BMI2 or not.

My thought is, if we figured out what instructions it can handle — we could perhaps add a “CONDA_BUILD” flag to the CMake scripts that would skip march=native, but would still pass the most essential performance flags to the compiler. This would produce (ideally) an x86_64 executable that may not be quite as optimized, but which would run on both x86_64 and M1/2 Macs. Let me know if you have any questions or thoughts on this.

Yeah, we should get an idea of what instructions are not permitted there...but anyway I do not think a x86 binary would execute well on ARM.

Ok, we are mixing two separate (although related) things in this thread:

  1. support SSHash on ARM (which, now, it should be done!);
  2. distribute SSHash via bioconda.

For point 2. I do not know since I haven't used bioconda and I'm not familiar with that. But if they do not have any M1/M2 build hosts...then it's not SSHash's (hence, nor our) fault :D Eventually they should get some building environment, I suppose, no? For the moment, we could just warn the users about this bioconda's limitation and ask them to download SSHash directly from GitHub and compile it manually (which is trivial).

rob-p commented 1 year ago

Hi @jermp,

I agree, these are 2 separate issues. I'm opening the Bioconda one in another issue (since it's not really the same as this). Any thoughts on why the "normal" parallelization doesn't work under M1/2 though?

Thanks! Rob

jermp commented 1 year ago

Hi @rob-p , it works now :) See comments above!

jermp commented 1 year ago

Support for ARM has been added. I'm closing this @rob-p but feel free to reopen it.