raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.15k stars 1.68k forks source link

AArch64 support #550

Closed grigorig closed 4 years ago

grigorig commented 8 years ago

Is there a chance of AArch64 builds of the userspace and kernel? What's missing to get this to work?

clivem commented 8 years ago

kernel support!

popcornmix commented 8 years ago

This isn't going to happen from us any time soon. A 64-bit kernel is not trivial (and could be produced by community).

deborah-c commented 8 years ago

Could it be produced by community? I think it might well need changes to the VC firmware to correspond, as interface structures would potentially change shape

popcornmix commented 8 years ago

The kernel could be. Depends on the implementation if the interface to VC needs to change. Forcing 32-bit pointers in interface to VC would be a sensible solution that wouldn't need a VC side change.

pelwell commented 8 years ago

MMAL is an awkward case, since it passes kernel pointers to the client and expects them to be echoed back - the space is 32-bits, so some form of compression or lookup table would be required.

6by9 commented 8 years ago

MMAL is an awkward case, since it passes kernel pointers to the client and expects them to be echoed back - the space is 32-bits, so some form of compression or lookup table would be required.

Really? That doesn't sound right as kernel pointers have no meaning outside the kernel. I'm happy to take a look if you'll email me details of the bit of concern.

pelwell commented 8 years ago

Take a look here: https://github.com/raspberrypi/linux/blob/rpi-4.1.y/drivers/media/platform/bcm2835/mmal-msg.h#L259 here: https://github.com/raspberrypi/linux/blob/rpi-4.1.y/drivers/media/platform/bcm2835/mmal-vchiq.c#L424 and here: https://github.com/raspberrypi/linux/blob/rpi-4.1.y/drivers/media/platform/bcm2835/mmal-vchiq.c#L510

6by9 commented 8 years ago

Fair cop - not nice. V4l2 driver is just copying the way mmal did it.

How brave are we feeling? We could pull in the rpmsg mmal service instead, however that loses the bulk transfer facility so may need a slight change to the client code.

edit Hang on, that is the V4L2 driver only, so all kernel side. It's expecting VC to echo back a kernel pointer, not userspace. I do have some changes planned for V4L2 which may help here (GSH and DC are aware). I'll check in a moment, but does the MMAL interface to userland have this same nastiness?

6by9 commented 8 years ago

:-( Userland also expects VC to preserve a kernel pointer https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_msgs.h#L360 That is just VC and kernel side, so could be updated fairly easily, but would be an ABI change between firmware and kernel (or we need the firmware to try and handle multiple different versions of structure). There's a couple of other pointers in structures passed to VC which would need attention too (eg https://github.com/raspberrypi/userland/blob/master/interface/mmal/vc/mmal_vc_msgs.h#L421)

Are the other services OK? IL had some niggles with having to set OMX_SKIP64BIT due to structure padding mismatches, but how does ILCS shape up more generally? Something will still need to reduce kernel 64bit pointers to 32 bit physicals for VC. VCSM? Mailbox services?

My memory is failing me - did we ever get a 64bit kernel running? All userspaces were certainly 32bit.

ghost commented 8 years ago

Speaking from a position of zero knowledge - how much of the Debian arm64 kernel source can be used before you run into problems?

deborah-c commented 8 years ago

At Broadcom, the intention was 64 bit user land over 32 bit kernel, as a long term thing.

TheSin- commented 8 years ago

why not 64bit across? Debian has a 64bit kernel and dist for arm. I know the Pi specific stuff would still need to be done, but why were they planning 32bit kernel? This is a curiosity question.

grigorig commented 8 years ago

32 bit kernel w/ 64 bit userland? I wasn't aware that this is a possible combination. Seems like a strange idea.

6by9 commented 8 years ago

At Broadcom, the intention was 64 bit user land over 32 bit kernel, as a long term thing.

I'd remembered other way up - 64 bit kernel, 32 bit userland (as that was the current state of Android). I couldn't remember if that work had actually happened - did we actually have A53s in a chip that was brought up?

pelwell commented 8 years ago

64-bit user space with 32-bit kernel is not possible on ARMv8. The kernel (especially the task switching) needs to be able to access all register state used by user space, which wouldn't be possible if the kernel was in 32-bit mode. The ARMv8 architecture allows an AArch32->AArch64 transition as the result of an exception/interrupt, and AArch64->AArch32 on return from an exception; the reverse routes don't exist.

grigorig commented 8 years ago

Okay, good to see this clarified.

On x86, 64 bit kernels have some (small) performance advantages even if combined with 32 bit userspace. Maybe that's a possible motivation to get it working on Pi 3 as well.

pelwell commented 8 years ago

There must be figures out there from all of those other A53-based SBCs comparing 32-bit vs 64-bit kernels - let's see some.

Ferroin commented 8 years ago

It's worth noting that the biggest reason on x86 for a performance increase is not the wider registers, but the fact that x86_64 has more general purpose registers available, which means on average you need fewer load/store operations to do the same calculation. AArch64 however has the same required registers as AArch32, so x86 is not really a good point of comparison for the performance difference.

While I don't personally have any figures, I can attest that there is a small but noticeable performance improvement for 64-bit vs 32-bit on both SPARC and PPC with recent kernels. I've not seen figures for any 64-bit ARM processors, but I would assume there will be a similar small but noticeable performance increase there as well as the differences between 32 and 64 bit modes on SPARC/PPC are relatively similar to those on ARM when compared to the changes on x86.

That said, I think the big thing that will really make the difference stand out is the fact that the in-kernel timekeeping structures are in the process of being converted from 32-bit to 64-bit to avoid the y2038 issue. Once that hits mainline, most 32-bit systems will likely show measurably lower performance as a result.

On a slightly separate note, I seem to recall hearing something about AArch64 natively supporting use of 32-bit pointers in otherwise 64-bit code (kind of like the X32 ABI on x86, just supported directly in hardware). If that is the case, then it might make handling the issues with pointer width a bit easier (and also result in overall better memory usage).

deborah-c commented 8 years ago

Sorry, my bad: I've clearly misremembered!

grigorig commented 8 years ago

On a slightly separate note, I seem to recall hearing something about AArch64 natively supporting use of 32-bit pointers in otherwise 64-bit code (kind of like the X32 ABI on x86, just supported directly in hardware). If that is the case, then it might make handling the issues with pointer width a bit easier (and also result in overall better memory usage).

I think it's called AArch64-ILP32. I am not sure if it is a good idea to use such an unusual ABI. No regular AArch32 or AArch64 binaries will work without a costly multilib setup.

Ferroin commented 8 years ago

We would need a multilib setup anyway for 32-bit support if we do a 64-bit version, otherwise we're actively breaking compatibility with existing systems. On the Pi, flash storage space is cheap and upgradeable, whereas RAM is not, and this situation is exactly the type of thing that such ABI's are designed for.. On top of that,t we don't need to worry about AArch64 compatibility, we have no established user base using it, and people are more likely to either use stuff bundled with Raspbian (or whatever other distribution)) or built locally than third party proprietary code, with the sole exception being the Oracle JDK, which isn't as critical as it was because we have much better performance now and IcedTea should run fine (and there's no hardware acceleration for Java on newer processors anyway, so using Jazelle doesn't really provide any performance improvement). Such compatibility would be nice, but should by no means be mandatory.

The big deciding factor should really be whether we can support all three ABI's at the same time (I know of no distribution on x86 that currently supports all three options there (32, 64, and x32), even though the kernel fully supports all having all three operating modes on the same system), and whether the processor itself supports it (I think it's optional, but I'm not sure, I've never had the time to read the ARMv8 ABI spec).

Aside from that, my point was more that using that ABI in the kernel may allow us to avoid having to deal with pointer width issues in the kernel drivers. I'm not certain however that the kernel fully supports it yet though in mainline.

ED6E0F17 commented 8 years ago

(Upstream, not rpi-specific) Kernel ILP32 support is not fully baked, but someone is putting a lot of effort into it:

https://lkml.org/lkml/2015/12/15/737

grigorig commented 8 years ago

I'm not very convinced that going for AArch64-ILP32 is a good idea. Raspbian is stuck with an unusual ARMv6 hard-fp architecture/ABI, but it was necessary given the BCM2835 SoC. Now we have a chance of finally switching to a standard ABI, so let's do that instead of going for some questionable new ABI that doesn't really have much support upstream.

Regarding multiarch, having to support less architectures is always a good thing. AArch64-ILP32 would add a third architecture into the mix. And storage might be cheap, but it's not free either! Also, multiarch can actually increase RAM usage because shared libraries can't be shared if processes of multiple architectures are running at the same time. This can be a pretty big deal if large frameworks like Qt are used.

niklas88 commented 8 years ago

I wonder how big the case is for binary compatibility anyway? Most Raspberry Pi users that actually have non-repository software probably either use scripting languages like Python or have the source and shouldn't have a problem combining a system upgrade with recompiling their code. Also interestingly Go 32bit ARM executables would only need a 32bit glibc besides there being ARM64 support. So that basically leaves the people working with C, C++ without access to source code while still having a desire to upgrade.

On the other hand many people likely do want to port code to ARM64 and would greatly benefit from the Rasberry Pi as an inexpensive ARM64 platform. So yeah I really don't see the Raspberry Pi as depending on binary compatibility.

turnip86 commented 8 years ago

For the love of all things digital, please provide an AArch64 kernel build. Debian and Arch already have arm64 ports, and a large number of Rapsberry Pi owners are using those distros already, one of the motivations being armv7 support on Pi 2. There are significant performance increases - 15% to 30% - in running AArch64 code versus AArch32 on Cortex A53:

http://www.cnx-software.com/2016/03/01/64-bit-arm-aarch64-instructions-boost-performance-by-15-to-30-compared-to-32-bit-arm-aarch32-instructions/ (pelwell, Ferroin: was this what you're looking for?)

And this does not take into account the benefits of AArch32 compared to ARMv7, like load-acquire/store-release, new VFP float and SIMD instructions, and the cryptography extensions. https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf (page 106)

One group of users that will directly benefit from this are people who use the Pi for media and emulation. OpenELEC, OSMC and RetroPie all have separate armv6 and armv7 releases specifically to maximize performance.

Would any Raspberry-specific userland code need to be patched for this?

Ferroin commented 8 years ago

@grigorig The big things that made me think about it were:

  1. AArch64-ILP32 is intended for memory constrained systems which will never need a 64-bit address space (VC4 limits us to 4G RAM, which fits this perfectly)
  2. There were multiple comments made about certain components only using 32-bit pointers, and thus potentially needing significant work to handle properly from a regular AArch64 kernel. I was advocating it less because I want to deal with it than because I thought it might help as a starting point.

@turnip86 There would likely be some significant code changes needed. From what I understand based on discussion both here and elsewhere, some of the hardware components only deal in 32-bit pointers, and handling that sanely will take some work, not only in the vc binaries, but likely also in most of the third-party stuff that uses hardware acceleration.

MrTomasz commented 8 years ago

Maybe let's try first running proper kernel in AArch64 mode?

I already did bunch of work to try boot it, but still can't see kernel booting on UART... as I mentioned on forums, it's not "make ARCH=arm64 defconfig Image" simple shot...

Anyone working on 64bit kernel as well?

TheSin- commented 8 years ago

I currently am but no where near ready for a test boot yet. And my Pi3 doesn't arrive for weeks yet sadly.

MrTomasz commented 8 years ago

@TheSin- How far are you with changes comparing to vanilla arm64 kernel ? Could you contact me?

TheSin- commented 8 years ago

Once I get a full build sure, but I'm sure I won't be the first or fastest source, there are ppl here much stronger at this stuff them me, I'm I'm just using the debian build system with cross compiling ATM, and I haven't made it very far cause i"m still messing the the defines for the .config build.

Not to mention I'm still working with 4.1 which debian no longer supports, been thinking about jumping to 4.4 but debian testing is only on 4.3, so lots to decide still. And I have no idea how stable the 4.3 and/or 4.4 branches are here. I assume everything in the 4.1 branch gets back ports to the other branches, but haven't looked into it. Though I'm sure a newer kernel would be easier to work with for arm64.

TheSin- commented 8 years ago

okay so with the 4.1 tree and the debian build system I've finally got config and such working I believe, but now I'm at my first VC issue, this is where things are gonna get icky for me anyhow as I assume we are going to have to convert everything to force 32bit integers.

/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:462:11: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
  .write = vc_cma_proc_write,
           ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:462:11: note: (near initialization for ‘vc_cma_proc_fops.write’)
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c: In function ‘vc_cma_alloc_chunks’:
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:584:3: error: implicit declaration of function ‘dmac_flush_range’ [-Werror=implicit-function-declaration]
   dmac_flush_range(chunk_addr, chunk_addr + chunk_size);
   ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:585:3: error: implicit declaration of function ‘outer_inv_range’ [-Werror=implicit-function-declaration]
   outer_inv_range(__pa(chunk_addr), __pa(chunk_addr) +
   ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c: In function ‘cma_worker_proc’:
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:651:7: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
   if ((unsigned int)msg >= VC_CMA_MSG_MAX) {
       ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:658:11: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
    type = (int)msg;
           ^
In file included from /root/rpi/linux-4.1/linux/include/linux/printk.h:6:0,
                 from /root/rpi/linux-4.1/linux/include/linux/kernel.h:13,
                 from /root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:34:
/root/rpi/linux-4.1/linux/include/linux/kern_levels.h:4:18: error: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘long unsigned int’ [-Werror=format=]
 #define KERN_SOH "\001"  /* ASCII Start Of Header */
                  ^
/root/rpi/linux-4.1/linux/include/linux/kern_levels.h:10:18: note: in expansion of macro ‘KERN_SOH’
 #define KERN_ERR KERN_SOH "3" /* error conditions */
                  ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:64:9: note: in expansion of macro ‘KERN_ERR’
  printk(KERN_ERR fmt "\n", ##__VA_ARGS__)
         ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:678:6: note: in expansion of macro ‘LOG_ERR’
      LOG_ERR
      ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:732:12: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
            (unsigned int)page);
            ^
/root/rpi/linux-4.1/linux/drivers/char/broadcom/vc_cma/vc_cma.c:64:30: note: in definition of macro ‘LOG_ERR’
  printk(KERN_ERR fmt "\n", ##__VA_ARGS__)
                              ^

Should I make a PR on the linux tree for the Kconfig changes? I'm mostly just reusing the 2709 stuff for now, since I don't have a 2710 to get more specific, I'd just like to be able to build to start, I know the VC stuff is going to take some time and planning but we all have to start someplace ;)

MrTomasz commented 8 years ago

You can try first to disable that kind of things. I believe it shall boot with minimal subset of things...

Remember also to disable EFI in config, otherwise you will create incompatible kernel binary.

TheSin- commented 8 years ago

yeah I just wanted to try with the VC stuff to start see how far I can make it. And as for EFI it's all set the same as my rpi and rpi2 builds. Anyhow trying it now with VC stuff disabled.

TheSin- commented 8 years ago

okay disabled VC stuff to try and get a little further, I'm now stuck with

/tmp/ccyU2uoV.s: Assembler messages:
/tmp/ccyU2uoV.s:117: Error: missing immediate expression at operand 1 -- `dsb '
/tmp/ccyU2uoV.s:199: Error: missing immediate expression at operand 1 -- `dsb '
/tmp/ccyU2uoV.s:297: Error: missing immediate expression at operand 1 -- `dsb '
/root/rpi/linux-4.1/linux/scripts/Makefile.build:258: recipe for target 'drivers/dma/bcm2708-dmaengine.o' failed
make[7]: *** [drivers/dma/bcm2708-dmaengine.o] Error 1

Seems like pretty much all the RPI stuff is going to have issues of some sort. Asm is not my thing so I'm going to have to skip that I assume.

MrTomasz commented 8 years ago

I don't have my code right now with me, but if I remember correctly, you're building with CONFIG_DMA_BCM2708_LEGACY=y which as I understand, it is wrong for BCM2709 (and 2710).

I did it in this way:

config DMA_BCM2708_LEGACY
    bool "BCM2708 DMA legacy API support"
    depends on (DMA_BCM2708 && !ARCH_BCM2710)
    default y
TheSin- commented 8 years ago

nice i'll try that thanks, I'm using 2709 as a base

madscientist42 commented 8 years ago

How're things coming along on this? Many are waiting with bated breath on the people trying right now (no sense in a bunch of duplicated efforts...)

popcornmix commented 8 years ago

Very impressive progress here. In the last week there has been: a 64-bit demo with uart output a 64-bit port of U-boot a 64-bit upstream kernel (single core only, and no gpu features)

madscientist42 commented 8 years ago

Epic. I'll need to pop over there to grab the work ongoing so that I can get a rough-cut for OE metadata there going. :D

swarren commented 8 years ago

Should this be closed now? Per the 3rd comment here, the Pi Foundation is going to leave 64-bit kernel support to the community which implies, and besides that aspect should probably be covered by a bug against the kernel git not the firmware git. The firmware does now support 64-bit booting, and any remaining issues re: that feature are covered by issue #579.

Ruffio commented 8 years ago

Should this be closed?

xcvista commented 8 years ago

I wonder if this method can solve this 32-bit pointer issue:

popcornmix commented 8 years ago

I assume this scheme wouldn't help Mongo DB which I believe maps the whole database file ( > 4GB) to virtual RAM. That's the only example I've seen reported as requiring a 64-bit address space to run.

But yes, if virtual and physical address spaces are limited to 32-bit then that would avoid the issue of pointers (e.g. userdata/cookies) being returned to applications from GPU callbacks. I'm sure some would argue that is not a fully 64-bit system (although with only 1GB of physical RAM the limitation is unlikely to affect many use cases).

xcvista commented 8 years ago

@popcornmix This is almost exactly what the x32 ABI for amd64 is - 32-bit pointers for an otherwise 64-bit system. I think this can be a stop-gap method between the 32-bit only and fully 64-bit kernel.

Another method would be introducing one layer of indirection in the kernel. Whenever the userland passes a pointer to the GPU, it is catched by the kernel, put into a buffer, and an kernel pointer to the buffer is passed to the GPU instead. The kernel still have to keep itself inside the top-half "canonical address" range for this to work though, as pointers are still passed with their high bits cropped off. This can affect the efficiency of user-mode GPU calls but removes the 32-bit pointer length limit.

It seem to me that this pair fits well in the current Raspbian/Raspbian Lite release model. The first have a virtual memory size limit of 4GB but have faster graphics, better suited as a desktop system; while the latter have full 64-bit virtual memory space but graphics can be atrociously slow, better suited as a headless server system.

popcornmix commented 8 years ago

There is an option of using 32-bit pointers globally (as a compiler default), but that precludes using standard 64-bit debian packages, so is not a favoured option.

64-bit pointers that are forced (through some kernel virtual address limiting) to only have 32 significant bits is a possibility, but doesn't fix Mongo DB.

I think the layer of indirection in the kernel<->GPU interface is probably the best option, but there may be some performance hit in the lookups. Probably not critical in general as I suspect the number of messages awaiting a response from GPU will normally be low, but there may be some situations where it gets to be a problem.

Currently we haven't seen strong evidence (e.g. benchmarks) that show there will be a noticeable performance improvement when moving to 64-bit, so it's unlikely to become a default configuration for raspbian and hence not a very high priority. We'd certainly like to support it for users who are interested, so suggestions for good ways to solve it are welcome.

xcvista commented 8 years ago

@popcornmix Both limited pointer solution and GPU trapping solution allows the use of standard Debian packages, and the trade-off is virtual memory space versus graphics performance. I think this should be a choice up to the user to make.

A 64-bit processor can handle SHA512 (as well as its friends SHA384, SHA512/224 and SHA512/256) much faster than a 32-bit one as the internal states, being 64-bit long, can fit in registers natively. Also AArch64 have more registers than AArch32, allowing for more aggressive optimizations.

cleverca22 commented 8 years ago

would it be possible to do both?

only use half of the 64 bits for any app dealing with the gpu

but use the full 64 bits for non-gpu things like mongodb?

xcvista commented 8 years ago

@cleverca22 Then how do you tell them apart? What if a program that have already claimed a memory block out of the canonical memory block suddenly start to call GPU?

cleverca22 commented 8 years ago

only thing i can think of there is a flag in the ELF headers that you set at compile time, to promise to never do GPU calls

though now that i think of it, you could also modify the userland, to just use mmap() to create a secondary heap in the lower 4gig of the userland?

xcvista commented 8 years ago

@cleverca22 There is a MAP_32BIT flag in mmap(2) for AMD64. Maybe we can implement this for AArch64? Usual malloc(3) does not have a virtual memory location promise (and can go over 2GB) but mmap(2) with MAP_32BIT guarantees a sub-2GB address range.